Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BasicNormalizer] Provide builder to configure the normalizer #321

Closed
sebastian-nagel opened this issue Sep 20, 2021 · 4 comments · Fixed by #324
Closed

[BasicNormalizer] Provide builder to configure the normalizer #321

sebastian-nagel opened this issue Sep 20, 2021 · 4 comments · Fixed by #324
Labels
enhancement normalizer Issues concerning our URL normalizer
Milestone

Comments

@sebastian-nagel
Copy link
Contributor

(suggested by @aecio in #309)

With #246/#309 the class BasicURLNormalizer adds a normalization routine which is not safe in rare cases. It should be possible to enable this optimization on-demand only by providing a builder. This would also allow to configure other options (eg. the set of query params to be removed) and to keep the interface clean otherwise (no constructors to pass configurations in).

@sebastian-nagel sebastian-nagel added enhancement normalizer Issues concerning our URL normalizer labels Sep 20, 2021
@sebastian-nagel sebastian-nagel added this to the 1.2 milestone Sep 20, 2021
@sebastian-nagel
Copy link
Contributor Author

Also whether internationalized domain names (IDNs) are converted to ASCII/Punycode (done currently) or Unicode could be made configurable.

@aecio
Copy link
Contributor

aecio commented Oct 4, 2021

@sebastian-nagel Let me know if you have started working on this. If you didn't, I can submit a PR for this tomorrow.

@sebastian-nagel
Copy link
Contributor Author

@aecio: not yet, feel free to start. Thanks!

aecio added a commit to aecio/crawler-commons that referenced this issue Oct 4, 2021
Usage example:
```
normalizer = BasicURLNormalizer.newBuilder()
  .idnNormalization(IdnNormalization.PUNYCODE)
  .queryParamsToRemove(
    asList("sid", "phpsessid", "sessionid", "jsessionid")
  )
  .build();
```

Closes crawler-commons#321.
aecio added a commit to aecio/crawler-commons that referenced this issue Oct 4, 2021
Usage example:
```
normalizer = BasicURLNormalizer.newBuilder()
  .idnNormalization(IdnNormalization.PUNYCODE)
  .queryParamsToRemove(
    asList("sid", "phpsessid", "sessionid", "jsessionid")
  )
  .build();
```

Closes crawler-commons#321.
@aecio
Copy link
Contributor

aecio commented Oct 4, 2021

@sebastian-nagel done in #324. I haven't added the ability to convert to IDNs to Unicode yet, as it seems a bit trickier (besides converting non-ASCII to Unicode, I think we would also need to detect Punycode and convert it to Unicode?).
I only added the ability to disable IDN normalization, but feel free to implement the Unicode normalization or remove this configuration if you feel it is not really useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement normalizer Issues concerning our URL normalizer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants