Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sitemap parsing #205

Merged

Conversation

sebastian-nagel
Copy link
Contributor

Note: this PR includes and is based on PR #200

@jnioche jnioche added this to the 0.10 milestone Apr 12, 2018
Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sebastian-nagel, looks good.
The formatting changed after running
mvn java-formatter:format
but this is not a big deal

@sebastian-nagel
Copy link
Contributor Author

Thanks! Applied code formatter to the lines changed/added by this PR.

- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes crawler-commons#202
- resolve relative links in RSS feeds, fixes crawler-commons#203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes crawler-commons#204
- extract links from <guid> elements in RSS feeds, fixes crawler-commons#201
@sebastian-nagel sebastian-nagel merged commit fa76a59 into crawler-commons:master Apr 25, 2018
@sebastian-nagel
Copy link
Contributor Author

Thanks, @jnioche! Completed changelog, squashed commits and merged.

@sebastian-nagel sebastian-nagel deleted the cc-201-202-203-204 branch April 25, 2018 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants