Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSS sitemaps] Extract links from <guid> elements #201

Closed
sebastian-nagel opened this issue Apr 10, 2018 · 1 comment · Fixed by #205
Closed

[RSS sitemaps] Extract links from <guid> elements #201

sebastian-nagel opened this issue Apr 10, 2018 · 1 comment · Fixed by #205
Labels

Comments

@sebastian-nagel
Copy link
Contributor

Some RSS feeds use <guid> elements to mark links. At least, if there is no <link> element, URLs should be also extracted from <guid>.

One example item from https://antarcticsun.usap.gov/resources/xml/antsun-science.xml:

<item>
  <title>Why Antarctic Fish Don't Freeze Their Tails Off</title>
  <description>An innovative project to understand how fish survive in the frigid Antarctic waters is opening up new avenues for researchers monitoring what goes on under the sea ice in McMurdo Sound. Evolutionary biologist Paul Cziko from the University of Oregon is studying how Antarctic fish don’t, themselves, freeze into a solid block while spending their lives in subzero waters.</description>
  <enclosure url="https://antarcticsun.usap.gov/science/images7/rss_4349-moocamera-lg.jpg" length="65539" type="image/jpeg" />
  <category>The Biological World</category>
  <pubDate>Tue, 10 Apr 2018 09:38:10 GMT</pubDate>
  <guid isPermaLink="true">https://antarcticsun.usap.gov/science/contentHandler.cfm?id=4349</guid>
  <linktext target="_blank">Read the Story</linktext>
  <altimage>https://antarcticsun.usap.gov/science/images7/rss_4349-moocamera-sm.jpg</altimage>
  <imagecaption>Paul Cziko dives under the frozen ice to install the McMurdo Oceanographic Observatory</imagecaption>
</item>
@sebastian-nagel
Copy link
Contributor Author

Note: the RSS spec says that "all elements of an item are optional" while the Atom spec requires at least one element as child of <entry>.

sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Apr 11, 2018
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes crawler-commons#202
- resolve relative links in RSS feeds, fixes crawler-commons#203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes crawler-commons#204
- extract links from <guid> elements in RSS feeds, fixes crawler-commons#201
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Apr 25, 2018
- ignore query part of URL to determine sitemap location prefix
  for URL validation, fixes crawler-commons#202
- resolve relative links in RSS feeds, fixes crawler-commons#203
- allow non-continuous content (containing XML entities or CDATA)
  when parsing links in RSS feeds, fixes crawler-commons#204
- extract links from <guid> elements in RSS feeds, fixes crawler-commons#201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant