-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query parameters normalization in BasicURLNormalizer #309
Query parameters normalization in BasicURLNormalizer #309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format / Parse are good for method names but are not descriptive enough so I don't understand what the method does unless I dive into the code (versus "isBlank" which is ok without javadoc)
src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java
Outdated
Show resolved
Hide resolved
ab610bf
to
baa9f6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @Chaiavi. I addressed your comments and tried to improve the comments and JavaDocs.
Nice one @aecio |
- Sort query parameters (fix crawler-commons#246) - Allows to (optionally) remove common irrelevant query parameters - Consistently encode query parameters with 'application/x-www-form-urlencoded'
baa9f6f
to
888c962
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @aecio. The PR includes a lot of useful normalization steps. I know it isn't easy to balance between strict normalization (eg. to avoid duplicates) and the preservation of potential semantically different URLs.
@@ -124,7 +124,7 @@ http://foo.com/aa/bb//foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com//aa//bb//foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com////aa////bb//foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com////aa////bb////foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com/aa?referer=http://bar.com, http://foo.com/aa?referer=http://bar.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it desired to obligatorily percent-encode colon :
and slash /
in query parameters?
- RFC3986 allows both characters in the query part (
:
belongs topchar
) - the whatwg URL spec only lists as "query-percent-encode-set": U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>)U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>) plus all control (U+0000 - U+001F) and non-ASCII chars (codepoint > U+007F)
- browsers (I tried the current version of Chrome and Firefox) leave
:/
unchanged in the query part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, if it is assumed that a query is always filled from an HTML form, the primary representation is the escaped one, although it's the longer one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to be clear, your suggestion is to normalize to the shorter version (http://foo.com/aa?referer=http%3A%2F%2Fbar.com
> http://foo.com/aa?referer=http://bar.com
), i.e., encode only the chars in the query-percent-encode-set
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on this change and found the following existing test cases which seem to be inconsistent:
http://x.com/show?http%3A%2F%2Fx.com%2Fb
=>http://x.com/show?http%3A%2F%2Fx.com%2Fb
http://foo.com/aa?referer=http://bar.com
=>http://foo.com/aa?referer=http://bar.com
Even though the first case is malformed with an incomplete query name-value pair, the string http%3A%2F%2Fx.com%2Fb
should likely be interpreted as a query name. If we are not encoding ':' and '/', then shouldn't the correct normalization be:
http://x.com/show?http%3A%2F%2Fx.com%2Fb
=> http://x.com/show?http://x.com/b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aecio, the query part of a URL is not necessarily filled from a form. This means it can contain any arbitrary query string which isn't a key-value structure and with a encoding not (fully) consistent with application/x-www-form-urlencoded
. Just a few examples:
- http://etab.ac-poitiers.fr/lepmo/spip.php?article247
- https://norma-jackson-fine-art.com/gallery/index.php?/category/52
- https://cmsimplexh.webdesign-keil.de/?Fachthemen/Ein_CMSimple_XH_Template_erstellen/Idee_%2F_Planung - here
/
and%2F
have a different semantic: replacing them causes a page not found
Of course, most often the query string stems from a HTML form. However, it's also important not to break valid URLs. That's what I mean with finding the "balance between strict normalization (eg. to avoid duplicates) and the preservation of potential semantically different URLs".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we should not change the URL when there is potential for breaking it. Your last example shows an interesting edge case where using a standard decoding could create problems. What is still not clear to me is what is the minimal set of normalizations that are safe to apply to the query data.
- Should we not change the data query data at all?
- When should we sort name-value pairs? When doing so, should we keep data encoding intact?
- Is there any case we want to change the data encoding (maybe only when there are illegal characters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the minimal set of normalizations that are safe to apply to the query data
Is there any case we want to change the data encoding...
See RFC3986 and the whatwg URL spec mentioned above. There's an interesting discussion in whatwg/url#369 and a nice JavaScript function to test how browsers normalize URLs.
Should we not change the data query data at all?
What about making this optional (maybe on by default)? Your PR already adds a constructor to configure the removal of parameters by name. Any good ideas how to make the URL normalization configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the URL normalization more configurable, we can change the constructor to a builder object that contains all configurable options. This would allow adding many options without having a complex constructor and while still keeping a simple fluent API with reasonable defaults options.
I think that for this case, the options could be the following?
urlEncodeQueryParameters = NEVER | ALWAYS | KEY_VALUE_PAIRS_ONLY
If so, what would be the desired default? KEY_VALUE_PAIRS_ONLY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments and the links, @sebastian-nagel.
@@ -124,7 +124,7 @@ http://foo.com/aa/bb//foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com//aa//bb//foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com////aa////bb//foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com////aa////bb////foo.html, http://foo.com/aa/bb/foo.html | |||
http://foo.com/aa?referer=http://bar.com, http://foo.com/aa?referer=http://bar.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to be clear, your suggestion is to normalize to the shorter version (http://foo.com/aa?referer=http%3A%2F%2Fbar.com
> http://foo.com/aa?referer=http://bar.com
), i.e., encode only the chars in the query-percent-encode-set
?
Lots of work has gone into this Where do we stand here ? |
@Chaiavi As I recall, this PR is doing more normalization than browsers typically do (and than some users would like by default). I was doing this on free time and eventually got too busy, but I'd be happy to continue when I get the chance if folks think this PR is going in the right direction. |
I would also opt to keep the defaults on the safe side and let users actively toggle on the more aggressive options.
But otherwise yes, it's the right direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for reviewing @sebastian-nagel. I still have some uncommitted changes where I was trying to follow the URL specs (RFC3986 and whatwg URL) more closely during URL encoding to avoid unsafe normalizations (which turns out to be a bit tricky). Before merging this: what about I reduce the scope of this PR and split it into two: one with query parameter sorting and removing common irrelevant query parameters, which could be merged now; and another PR with the unsafe URL encoding normalizations, which can be merged later (maybe after the builder is introduced)? |
Yes, I definitely agree: if there is still work to do, better split this PR into smaller work packages. Easier to review, discuss, and less danger that complex PRs lie around until they get finally merged. So, I'll wait for you updates on this PR. Let us know if you want to work on #321, otherwise I'd try to open a PR. |
Ok, I cleaned up all the extra URL encoding code and use the existing escape/unescape methods instead. Also, added some test cases that cover URLs that have data in the query that should not be encoded. I'll try to open a new PR later only with the URL encoding part. I'd be happy to work on #321 as well, but won't be able to do it until the weekend or maybe next week. So feel to do it if you'd rather get it done sooner. |
Thanks, @aecio! Rebased, squashed and fixed the sitemaps unit tests using the URL normalizer and now failing on sorted query params. |
Related to #308.