-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BasicNormalizer] Provide builder to configure the normalizer #321
Comments
Also whether internationalized domain names (IDNs) are converted to ASCII/Punycode (done currently) or Unicode could be made configurable. |
@sebastian-nagel Let me know if you have started working on this. If you didn't, I can submit a PR for this tomorrow. |
@aecio: not yet, feel free to start. Thanks! |
Usage example: ``` normalizer = BasicURLNormalizer.newBuilder() .idnNormalization(IdnNormalization.PUNYCODE) .queryParamsToRemove( asList("sid", "phpsessid", "sessionid", "jsessionid") ) .build(); ``` Closes crawler-commons#321.
Usage example: ``` normalizer = BasicURLNormalizer.newBuilder() .idnNormalization(IdnNormalization.PUNYCODE) .queryParamsToRemove( asList("sid", "phpsessid", "sessionid", "jsessionid") ) .build(); ``` Closes crawler-commons#321.
@sebastian-nagel done in #324. I haven't added the ability to convert to IDNs to Unicode yet, as it seems a bit trickier (besides converting non-ASCII to Unicode, I think we would also need to detect Punycode and convert it to Unicode?). |
(suggested by @aecio in #309)
With #246/#309 the class BasicURLNormalizer adds a normalization routine which is not safe in rare cases. It should be possible to enable this optimization on-demand only by providing a builder. This would also allow to configure other options (eg. the set of query params to be removed) and to keep the interface clean otherwise (no constructors to pass configurations in).
The text was updated successfully, but these errors were encountered: