Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sitemaps] Trim Unicode whitespace around URLs #224

Closed
moviewang opened this issue Dec 11, 2018 · 9 comments
Closed

[Sitemaps] Trim Unicode whitespace around URLs #224

moviewang opened this issue Dec 11, 2018 · 9 comments

Comments

@moviewang
Copy link

moviewang commented Dec 11, 2018

image

loc.toString().trim() didn' trim all whitespace

@moviewang moviewang reopened this Dec 11, 2018
@sebastian-nagel
Copy link
Contributor

Hi @moviewang, can you share a minimal sitemap to reproduce your problem? It's ok if the URLs are masked but the white space should be there. Thanks!
After a first look: String.trim() only removes ASCII white space and control characters and does not remove all Unicode white space. I see no issues replacing the trim() by a method which also strips Unicode white space, similar StringUtils.strip() from commons-lang.

@moviewang
Copy link
Author

Hi @moviewang, can you share a minimal sitemap to reproduce your problem? It's ok if the URLs are masked but the white space should be there. Thanks!
After a first look: String.trim() only removes ASCII white space and control characters and does not remove all Unicode white space. I see no issues replacing the trim() by a method which also strips Unicode white space, similar StringUtils.strip() from commons-lang.

Yes, the sitemap's loc elements contains white space, but I'have no rights to modify it. Is there any else approach to solve the problem. Thanks!

@kkrugler
Copy link
Contributor

Hi @moviewang - can you provide the URL to the sitemap?

@moviewang
Copy link
Author

image
Hi@kkrugler - The sitemapIndex content like this.

@sebastian-nagel
Copy link
Contributor

Hi @moviewang, is there any invisible Unicode white space (not in the ASCII range)? I've tried to reproduce it with a similar file:

  • the first URL is properly trimmed
    screenshot_20181212_112311
  • the second isn't because there is a non-breaking space (U+00a0) hidden
    screenshot_20181212_112415

@kkrugler
Copy link
Contributor

Hi @moviewang - I tried modifying one of the existing sitemap index parsing tests to show this problem, but it seems to parse the URL without any issues (assuming they look like what you showed above). I did see some other issues, like the individual sitemaps not having their "processed" flag set, and no last modified date, etc. but that's a different can of worms.

@kkrugler
Copy link
Contributor

@moviewang - the dates in your example aren't valid for sitemaps, then need to follow one of these formats. So if the dates are in UTC, they should be something like 2018-12-12T02:36:56Z.

There are two other issues, which I'll file as bugs separately, but only one might impact you. Data strings aren't getting trimmed, so the whitespace around the date string could cause problems.

@moviewang
Copy link
Author

Hi @sebastian-nagel @kkrugler
Thanks a lot ,I'm very grateful to you for your help. the problem maybe caused by the the url contains unicode white space as @sebastian-nagel said. Please close this issue. Thanks!!!

@sebastian-nagel
Copy link
Contributor

Hi @moviewang, I would suggest the parser also trim Unicode white space. I'll open a PR to fix this. The intention of my inquiry was to make sure whether this is the reason for your problem. Thanks!

@sebastian-nagel sebastian-nagel changed the title sitemaps contains prefix whitespace log print Bad url [Sitemaps] Trim Unicode whitespace around URLs Dec 13, 2018
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Dec 13, 2018
@sebastian-nagel sebastian-nagel added this to the 0.11 milestone Dec 13, 2018
sebastian-nagel added a commit that referenced this issue Feb 20, 2019
…code-whitespace

[Sitemaps] Trim Unicode whitespace around URLs, fixes #224
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants