Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sitemaps] Trim Unicode whitespace around URLs, fixes #224 #228

Conversation

sebastian-nagel
Copy link
Contributor

Trim all Unicode white space from URLs in sitemaps, sitemap indexes and feeds.

* Check whether character is any Unicode whitespace, including the space
* characters not covered by {@link Character#isWhitespace(char)}
*/
public static boolean isWhitespace(char c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't here other places in the code calling Character.isWhitespace() that we'd want to change to using this routine? E.g. there was the other PR where the text being collected is checked to see if it's all whitespace. And if so, maybe it should go into a different class (not sure what string/char-ish utils we've got already).

Copy link
Contributor

@kkrugler kkrugler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about more general use of the whitespace routine.

@sebastian-nagel sebastian-nagel force-pushed the cc-224-sitemaps-trim-unicode-whitespace branch from 8025cc0 to 67db8bf Compare February 20, 2019 15:28
@sebastian-nagel
Copy link
Contributor Author

Hi @kkrugler, updated the PR, rebased to current master, added changelog entry. All occurrences of Character.isWhitespace(char ch) are now replaced by a call of isWhitespace(char c). This applies to all code of the Sitemap parser. Afaics, all other packages (eg., the robots.txt parser) only support ASCII white space.

Copy link
Contributor

@kkrugler kkrugler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sebastian-nagel sebastian-nagel merged commit 4d6b27c into crawler-commons:master Feb 20, 2019
@sebastian-nagel sebastian-nagel deleted the cc-224-sitemaps-trim-unicode-whitespace branch February 20, 2019 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants