Atom Feed type="xhtml" Parse Issue

Feedzirra or sax-machine or Nokogiri has an issue with Atom feeds that use 

``` xml
<content type="xhtml">
```

The result is that all HTML is stripped out.

Sax-machine uses Nokogiri for XML parsing.

For example, when Hypercritical's feed is parsed using  `Nokogiri::XML(xml_string)` you get:
https://gist.github.com/benubois/5520828

The `<content>` gets parsed into a tree as well. I think what sax-machine needs to do at this point is recognize that type="xhtml" and just return the contents as a string.

There is a [two year old pull request for Feedzirra](https://github.com/pauldix/feedzirra/pull/58) about this issue, however the code no longer works. 

From the Atom spec on how `type="xhtml"` should be handled:

```
3.1.1.3.  XHTML

Example atom:title with XHTML content:

...
<title type="xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <xhtml:div>
    Less: <xhtml:em> &lt; </xhtml:em>
  </xhtml:div>
</title>
...

If the value of "type" is "xhtml", the content of the Text construct
MUST be a single XHTML div element [XHTML] and SHOULD be suitable for
handling as XHTML.  The XHTML div element itself MUST NOT be
considered part of the content.  Atom Processors that display the
content MAY use the markup to aid in displaying it.  The escaped
versions of characters such as "&" and ">" represent those
characters, not markup.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atom Feed type="xhtml" Parse Issue #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Atom Feed type="xhtml" Parse Issue #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions