Query parameters normalization in BasicURLNormalizer #309

aecio · 2021-01-04T03:47:41Z

Sort query parameters (fix [BasicNormalizer] sorting the Query Parameters #246)
Allows to (optionally) remove common irrelevant query parameters

Related to #308.

Chaiavi

Format / Parse are good for method names but are not descriptive enough so I don't understand what the method does unless I dive into the code (versus "isBlank" which is ok without javadoc)

src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java

src/main/java/crawlercommons/utils/Strings.java

aecio

Thanks for the review, @Chaiavi. I addressed your comments and tried to improve the comments and JavaDocs.

src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java

jnioche · 2021-01-05T17:59:28Z

Nice one @aecio
Could you please add the copyright headers to the new classes + add a line to CHANGES.txt?
Sorry if this sounds a bit formal. I haven't had time to review in details yet

- Sort query parameters (fix crawler-commons#246) - Allows to (optionally) remove common irrelevant query parameters - Consistently encode query parameters with 'application/x-www-form-urlencoded'

aecio · 2021-01-05T18:38:43Z

Thanks, @jnioche. Just updated the files. I added different entries for the major changes, including issues #246 and #308.

sebastian-nagel

Thanks, @aecio. The PR includes a lot of useful normalization steps. I know it isn't easy to balance between strict normalization (eg. to avoid duplicates) and the preservation of potential semantically different URLs.

sebastian-nagel · 2021-01-12T13:52:18Z

src/test/resources/normalizer/weirdToNormalizedUrls.csv

@@ -124,7 +124,7 @@ http://foo.com/aa/bb//foo.html, http://foo.com/aa/bb/foo.html
 http://foo.com//aa//bb//foo.html, http://foo.com/aa/bb/foo.html
 http://foo.com////aa////bb//foo.html, http://foo.com/aa/bb/foo.html
 http://foo.com////aa////bb////foo.html, http://foo.com/aa/bb/foo.html
-http://foo.com/aa?referer=http://bar.com, http://foo.com/aa?referer=http://bar.com


Is it desired to obligatorily percent-encode colon : and slash / in query parameters?

RFC3986 allows both characters in the query part (: belongs to pchar)

the whatwg URL spec only lists as "query-percent-encode-set": U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>)U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>) plus all control (U+0000 - U+001F) and non-ASCII chars (codepoint > U+007F)

browsers (I tried the current version of Chrome and Firefox) leave :/ unchanged in the query part

Of course, if it is assumed that a query is always filled from an HTML form, the primary representation is the escaped one, although it's the longer one.

So, just to be clear, your suggestion is to normalize to the shorter version (http://foo.com/aa?referer=http%3A%2F%2Fbar.com > http://foo.com/aa?referer=http://bar.com), i.e., encode only the chars in the query-percent-encode-set?

I was working on this change and found the following existing test cases which seem to be inconsistent:

http://x.com/show?http%3A%2F%2Fx.com%2Fb => http://x.com/show?http%3A%2F%2Fx.com%2Fb

http://foo.com/aa?referer=http://bar.com => http://foo.com/aa?referer=http://bar.com

Even though the first case is malformed with an incomplete query name-value pair, the string http%3A%2F%2Fx.com%2Fb should likely be interpreted as a query name. If we are not encoding ':' and '/', then shouldn't the correct normalization be:
http://x.com/show?http%3A%2F%2Fx.com%2Fb => http://x.com/show?http://x.com/b

Hi @aecio, the query part of a URL is not necessarily filled from a form. This means it can contain any arbitrary query string which isn't a key-value structure and with a encoding not (fully) consistent with application/x-www-form-urlencoded. Just a few examples:

http://etab.ac-poitiers.fr/lepmo/spip.php?article247

https://norma-jackson-fine-art.com/gallery/index.php?/category/52

https://cmsimplexh.webdesign-keil.de/?Fachthemen/Ein_CMSimple_XH_Template_erstellen/Idee_%2F_Planung - here / and %2F have a different semantic: replacing them causes a page not found

Of course, most often the query string stems from a HTML form. However, it's also important not to break valid URLs. That's what I mean with finding the "balance between strict normalization (eg. to avoid duplicates) and the preservation of potential semantically different URLs".

I understand we should not change the URL when there is potential for breaking it. Your last example shows an interesting edge case where using a standard decoding could create problems. What is still not clear to me is what is the minimal set of normalizations that are safe to apply to the query data.

Should we not change the data query data at all?

When should we sort name-value pairs? When doing so, should we keep data encoding intact?

Is there any case we want to change the data encoding (maybe only when there are illegal characters)?

the minimal set of normalizations that are safe to apply to the query data
Is there any case we want to change the data encoding...

See RFC3986 and the whatwg URL spec mentioned above. There's an interesting discussion in whatwg/url#369 and a nice JavaScript function to test how browsers normalize URLs.

Should we not change the data query data at all?

What about making this optional (maybe on by default)? Your PR already adds a constructor to configure the removal of parameters by name. Any good ideas how to make the URL normalization configurable?

To make the URL normalization more configurable, we can change the constructor to a builder object that contains all configurable options. This would allow adding many options without having a complex constructor and while still keeping a simple fluent API with reasonable defaults options.

I think that for this case, the options could be the following?

urlEncodeQueryParameters = NEVER | ALWAYS | KEY_VALUE_PAIRS_ONLY

If so, what would be the desired default? KEY_VALUE_PAIRS_ONLY?

src/test/resources/normalizer/weirdToNormalizedUrls.csv

aecio

Thanks for the comments and the links, @sebastian-nagel.

aecio · 2021-01-12T15:41:15Z

src/test/resources/normalizer/weirdToNormalizedUrls.csv

@@ -124,7 +124,7 @@ http://foo.com/aa/bb//foo.html, http://foo.com/aa/bb/foo.html
 http://foo.com//aa//bb//foo.html, http://foo.com/aa/bb/foo.html
 http://foo.com////aa////bb//foo.html, http://foo.com/aa/bb/foo.html
 http://foo.com////aa////bb////foo.html, http://foo.com/aa/bb/foo.html
-http://foo.com/aa?referer=http://bar.com, http://foo.com/aa?referer=http://bar.com


So, just to be clear, your suggestion is to normalize to the shorter version (http://foo.com/aa?referer=http%3A%2F%2Fbar.com > http://foo.com/aa?referer=http://bar.com), i.e., encode only the chars in the query-percent-encode-set?

src/test/resources/normalizer/weirdToNormalizedUrls.csv

Chaiavi · 2021-08-17T08:13:03Z

Lots of work has gone into this

Where do we stand here ?
Can anyone summarize what is still missing (@aecio , @sebastian-nagel) ?

aecio · 2021-08-18T15:23:44Z

@Chaiavi As I recall, this PR is doing more normalization than browsers typically do (and than some users would like by default). I was doing this on free time and eventually got too busy, but I'd be happy to continue when I get the chance if folks think this PR is going in the right direction.

sebastian-nagel · 2021-08-19T07:31:40Z

doing more normalization than browsers typically do

I would also opt to keep the defaults on the safe side and let users actively toggle on the more aggressive options.

this PR is going in the right direction

But otherwise yes, it's the right direction.

sebastian-nagel

+1 Looks good. I'd opt to merge this PR now and opened #321 to address the configuration of the BasicURLNormalizer by providing a builder. Thanks, @aecio!

aecio · 2021-09-20T13:15:41Z

Thanks, for reviewing @sebastian-nagel. I still have some uncommitted changes where I was trying to follow the URL specs (RFC3986 and whatwg URL) more closely during URL encoding to avoid unsafe normalizations (which turns out to be a bit tricky). Before merging this: what about I reduce the scope of this PR and split it into two: one with query parameter sorting and removing common irrelevant query parameters, which could be merged now; and another PR with the unsafe URL encoding normalizations, which can be merged later (maybe after the builder is introduced)?

sebastian-nagel · 2021-09-20T13:48:21Z

Yes, I definitely agree: if there is still work to do, better split this PR into smaller work packages. Easier to review, discuss, and less danger that complex PRs lie around until they get finally merged. So, I'll wait for you updates on this PR. Let us know if you want to work on #321, otherwise I'd try to open a PR.

aecio · 2021-09-20T15:39:45Z

Ok, I cleaned up all the extra URL encoding code and use the existing escape/unescape methods instead. Also, added some test cases that cover URLs that have data in the query that should not be encoded. I'll try to open a new PR later only with the URL encoding part.

I'd be happy to work on #321 as well, but won't be able to do it until the weekend or maybe next week. So feel to do it if you'd rather get it done sooner.

sebastian-nagel · 2021-09-21T11:32:46Z

Thanks, @aecio!

Rebased, squashed and fixed the sitemaps unit tests using the URL normalizer and now failing on sorted query params.

Chaiavi reviewed Jan 4, 2021

View reviewed changes

src/main/java/crawlercommons/utils/Strings.java Show resolved Hide resolved

src/main/java/crawlercommons/utils/Strings.java Outdated Show resolved Hide resolved

aecio force-pushed the aecio/query-params-normalization branch from ab610bf to baa9f6f Compare January 4, 2021 18:23

aecio commented Jan 4, 2021

View reviewed changes

Query parameters normalization

888c962

- Sort query parameters (fix crawler-commons#246) - Allows to (optionally) remove common irrelevant query parameters - Consistently encode query parameters with 'application/x-www-form-urlencoded'

aecio force-pushed the aecio/query-params-normalization branch from baa9f6f to 888c962 Compare January 5, 2021 18:27

sebastian-nagel reviewed Jan 12, 2021

View reviewed changes

aecio commented Jan 12, 2021

View reviewed changes

Chaiavi added this to the 1.2 milestone Aug 11, 2021

Chaiavi added the normalizer Issues concerning our URL normalizer label Aug 17, 2021

Chaiavi requested a review from sebastian-nagel August 18, 2021 15:27

Do not split query parameters on semicolon

648906d

sebastian-nagel mentioned this pull request Sep 20, 2021

[BasicNormalizer] Provide builder to configure the normalizer #321

Closed

sebastian-nagel approved these changes Sep 20, 2021

View reviewed changes

Remove query parmaters encoding

115cc59

sebastian-nagel closed this in a10cf25 Sep 21, 2021

This was referenced Jun 13, 2023

[Robots.txt] Add units test based on examples in RFC 9309 #420

Merged

[BasicNormalizer] Query parameters normalization in BasicURLNormalizer #308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query parameters normalization in BasicURLNormalizer #309

Query parameters normalization in BasicURLNormalizer #309

aecio commented Jan 4, 2021 •

edited

Loading

Chaiavi left a comment

aecio left a comment •

edited

Loading

jnioche commented Jan 5, 2021

aecio commented Jan 5, 2021

sebastian-nagel left a comment

sebastian-nagel Jan 12, 2021

sebastian-nagel Jan 12, 2021

aecio Jan 12, 2021

aecio Jan 18, 2021

sebastian-nagel Jan 18, 2021 •

edited

Loading

aecio Jan 18, 2021

sebastian-nagel Jan 26, 2021

aecio Aug 18, 2021

aecio left a comment

aecio Jan 12, 2021

Chaiavi commented Aug 17, 2021

aecio commented Aug 18, 2021

sebastian-nagel commented Aug 19, 2021

sebastian-nagel left a comment

aecio commented Sep 20, 2021

sebastian-nagel commented Sep 20, 2021

aecio commented Sep 20, 2021

sebastian-nagel commented Sep 21, 2021

Query parameters normalization in BasicURLNormalizer #309

Query parameters normalization in BasicURLNormalizer #309

Conversation

aecio commented Jan 4, 2021 • edited Loading

Chaiavi left a comment

Choose a reason for hiding this comment

aecio left a comment • edited Loading

Choose a reason for hiding this comment

jnioche commented Jan 5, 2021

aecio commented Jan 5, 2021

sebastian-nagel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastian-nagel Jan 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aecio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chaiavi commented Aug 17, 2021

aecio commented Aug 18, 2021

sebastian-nagel commented Aug 19, 2021

sebastian-nagel left a comment

Choose a reason for hiding this comment

aecio commented Sep 20, 2021

sebastian-nagel commented Sep 20, 2021

aecio commented Sep 20, 2021

sebastian-nagel commented Sep 21, 2021

aecio commented Jan 4, 2021 •

edited

Loading

aecio left a comment •

edited

Loading

sebastian-nagel Jan 18, 2021 •

edited

Loading