Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query parameters normalization in BasicURLNormalizer #309

Closed

Conversation

aecio
Copy link
Contributor

@aecio aecio commented Jan 4, 2021

Related to #308.

Copy link
Member

@Chaiavi Chaiavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format / Parse are good for method names but are not descriptive enough so I don't understand what the method does unless I dive into the code (versus "isBlank" which is ok without javadoc)

@aecio aecio force-pushed the aecio/query-params-normalization branch from ab610bf to baa9f6f Compare January 4, 2021 18:23
Copy link
Contributor Author

@aecio aecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @Chaiavi. I addressed your comments and tried to improve the comments and JavaDocs.

@jnioche
Copy link
Contributor

jnioche commented Jan 5, 2021

Nice one @aecio
Could you please add the copyright headers to the new classes + add a line to CHANGES.txt?
Sorry if this sounds a bit formal. I haven't had time to review in details yet

- Sort query parameters (fix crawler-commons#246)
- Allows to (optionally) remove common irrelevant query parameters
- Consistently encode query parameters with
'application/x-www-form-urlencoded'
@aecio aecio force-pushed the aecio/query-params-normalization branch from baa9f6f to 888c962 Compare January 5, 2021 18:27
@aecio
Copy link
Contributor Author

aecio commented Jan 5, 2021

Thanks, @jnioche. Just updated the files. I added different entries for the major changes, including issues #246 and #308.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @aecio. The PR includes a lot of useful normalization steps. I know it isn't easy to balance between strict normalization (eg. to avoid duplicates) and the preservation of potential semantically different URLs.

@@ -124,7 +124,7 @@ http://foo.com/aa/bb//foo.html, http://foo.com/aa/bb/foo.html
http://foo.com//aa//bb//foo.html, http://foo.com/aa/bb/foo.html
http://foo.com////aa////bb//foo.html, http://foo.com/aa/bb/foo.html
http://foo.com////aa////bb////foo.html, http://foo.com/aa/bb/foo.html
http://foo.com/aa?referer=http://bar.com, http://foo.com/aa?referer=http://bar.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it desired to obligatorily percent-encode colon : and slash / in query parameters?

  • RFC3986 allows both characters in the query part (: belongs to pchar)
  • the whatwg URL spec only lists as "query-percent-encode-set": U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>)U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>) plus all control (U+0000 - U+001F) and non-ASCII chars (codepoint > U+007F)
  • browsers (I tried the current version of Chrome and Firefox) leave :/ unchanged in the query part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, if it is assumed that a query is always filled from an HTML form, the primary representation is the escaped one, although it's the longer one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to be clear, your suggestion is to normalize to the shorter version (http://foo.com/aa?referer=http%3A%2F%2Fbar.com > http://foo.com/aa?referer=http://bar.com), i.e., encode only the chars in the query-percent-encode-set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on this change and found the following existing test cases which seem to be inconsistent:

  1. http://x.com/show?http%3A%2F%2Fx.com%2Fb => http://x.com/show?http%3A%2F%2Fx.com%2Fb
  2. http://foo.com/aa?referer=http://bar.com => http://foo.com/aa?referer=http://bar.com

Even though the first case is malformed with an incomplete query name-value pair, the string http%3A%2F%2Fx.com%2Fb should likely be interpreted as a query name. If we are not encoding ':' and '/', then shouldn't the correct normalization be:
http://x.com/show?http%3A%2F%2Fx.com%2Fb => http://x.com/show?http://x.com/b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aecio, the query part of a URL is not necessarily filled from a form. This means it can contain any arbitrary query string which isn't a key-value structure and with a encoding not (fully) consistent with application/x-www-form-urlencoded. Just a few examples:

Of course, most often the query string stems from a HTML form. However, it's also important not to break valid URLs. That's what I mean with finding the "balance between strict normalization (eg. to avoid duplicates) and the preservation of potential semantically different URLs".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we should not change the URL when there is potential for breaking it. Your last example shows an interesting edge case where using a standard decoding could create problems. What is still not clear to me is what is the minimal set of normalizations that are safe to apply to the query data.

  • Should we not change the data query data at all?
  • When should we sort name-value pairs? When doing so, should we keep data encoding intact?
  • Is there any case we want to change the data encoding (maybe only when there are illegal characters)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the minimal set of normalizations that are safe to apply to the query data
Is there any case we want to change the data encoding...

See RFC3986 and the whatwg URL spec mentioned above. There's an interesting discussion in whatwg/url#369 and a nice JavaScript function to test how browsers normalize URLs.

Should we not change the data query data at all?

What about making this optional (maybe on by default)? Your PR already adds a constructor to configure the removal of parameters by name. Any good ideas how to make the URL normalization configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the URL normalization more configurable, we can change the constructor to a builder object that contains all configurable options. This would allow adding many options without having a complex constructor and while still keeping a simple fluent API with reasonable defaults options.

I think that for this case, the options could be the following?

urlEncodeQueryParameters = NEVER | ALWAYS | KEY_VALUE_PAIRS_ONLY

If so, what would be the desired default? KEY_VALUE_PAIRS_ONLY?

src/test/resources/normalizer/weirdToNormalizedUrls.csv Outdated Show resolved Hide resolved
Copy link
Contributor Author

@aecio aecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments and the links, @sebastian-nagel.

@@ -124,7 +124,7 @@ http://foo.com/aa/bb//foo.html, http://foo.com/aa/bb/foo.html
http://foo.com//aa//bb//foo.html, http://foo.com/aa/bb/foo.html
http://foo.com////aa////bb//foo.html, http://foo.com/aa/bb/foo.html
http://foo.com////aa////bb////foo.html, http://foo.com/aa/bb/foo.html
http://foo.com/aa?referer=http://bar.com, http://foo.com/aa?referer=http://bar.com
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to be clear, your suggestion is to normalize to the shorter version (http://foo.com/aa?referer=http%3A%2F%2Fbar.com > http://foo.com/aa?referer=http://bar.com), i.e., encode only the chars in the query-percent-encode-set?

src/test/resources/normalizer/weirdToNormalizedUrls.csv Outdated Show resolved Hide resolved
@Chaiavi Chaiavi added this to the 1.2 milestone Aug 11, 2021
@Chaiavi Chaiavi added the normalizer Issues concerning our URL normalizer label Aug 17, 2021
@Chaiavi
Copy link
Member

Chaiavi commented Aug 17, 2021

Lots of work has gone into this

Where do we stand here ?
Can anyone summarize what is still missing (@aecio , @sebastian-nagel) ?

@aecio
Copy link
Contributor Author

aecio commented Aug 18, 2021

@Chaiavi As I recall, this PR is doing more normalization than browsers typically do (and than some users would like by default). I was doing this on free time and eventually got too busy, but I'd be happy to continue when I get the chance if folks think this PR is going in the right direction.

@sebastian-nagel
Copy link
Contributor

doing more normalization than browsers typically do

I would also opt to keep the defaults on the safe side and let users actively toggle on the more aggressive options.

this PR is going in the right direction

But otherwise yes, it's the right direction.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Looks good. I'd opt to merge this PR now and opened #321 to address the configuration of the BasicURLNormalizer by providing a builder. Thanks, @aecio!

@aecio
Copy link
Contributor Author

aecio commented Sep 20, 2021

Thanks, for reviewing @sebastian-nagel. I still have some uncommitted changes where I was trying to follow the URL specs (RFC3986 and whatwg URL) more closely during URL encoding to avoid unsafe normalizations (which turns out to be a bit tricky). Before merging this: what about I reduce the scope of this PR and split it into two: one with query parameter sorting and removing common irrelevant query parameters, which could be merged now; and another PR with the unsafe URL encoding normalizations, which can be merged later (maybe after the builder is introduced)?

@sebastian-nagel
Copy link
Contributor

Yes, I definitely agree: if there is still work to do, better split this PR into smaller work packages. Easier to review, discuss, and less danger that complex PRs lie around until they get finally merged. So, I'll wait for you updates on this PR. Let us know if you want to work on #321, otherwise I'd try to open a PR.

@aecio
Copy link
Contributor Author

aecio commented Sep 20, 2021

Ok, I cleaned up all the extra URL encoding code and use the existing escape/unescape methods instead. Also, added some test cases that cover URLs that have data in the query that should not be encoded. I'll try to open a new PR later only with the URL encoding part.

I'd be happy to work on #321 as well, but won't be able to do it until the weekend or maybe next week. So feel to do it if you'd rather get it done sooner.

@sebastian-nagel
Copy link
Contributor

Thanks, @aecio!

Rebased, squashed and fixed the sitemaps unit tests using the URL normalizer and now failing on sorted query params.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
normalizer Issues concerning our URL normalizer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BasicNormalizer] sorting the Query Parameters
4 participants