Skip to content
This repository was archived by the owner on Jun 4, 2020. It is now read-only.
This repository was archived by the owner on Jun 4, 2020. It is now read-only.

Atom Feed type="xhtml" Parse Issue #120

@benubois

Description

@benubois

Feedzirra or sax-machine or Nokogiri has an issue with Atom feeds that use

<content type="xhtml">

The result is that all HTML is stripped out.

Sax-machine uses Nokogiri for XML parsing.

For example, when Hypercritical's feed is parsed using Nokogiri::XML(xml_string) you get:
https://gist.github.com/benubois/5520828

The <content> gets parsed into a tree as well. I think what sax-machine needs to do at this point is recognize that type="xhtml" and just return the contents as a string.

There is a two year old pull request for Feedzirra about this issue, however the code no longer works.

From the Atom spec on how type="xhtml" should be handled:

3.1.1.3.  XHTML

Example atom:title with XHTML content:

...
<title type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <xhtml:div>
    Less: <xhtml:em> &lt; </xhtml:em>
  </xhtml:div>
</title>
...

If the value of "type" is "xhtml", the content of the Text construct
MUST be a single XHTML div element [XHTML] and SHOULD be suitable for
handling as XHTML.  The XHTML div element itself MUST NOT be
considered part of the content.  Atom Processors that display the
content MAY use the markup to aid in displaying it.  The escaped
versions of characters such as "&" and ">" represent those
characters, not markup.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions