Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epub HTML body contains a body tag #124

Closed
tpikonen opened this issue Mar 16, 2021 · 1 comment · Fixed by #125
Closed

epub HTML body contains a body tag #124

tpikonen opened this issue Mar 16, 2021 · 1 comment · Fixed by #125

Comments

@tpikonen
Copy link

Environment

  • Operating System: Debian bullseye
  • node --version: v12.21.0 (from Debian)
  • npm --version: 7.5.2 (from Debian)
  • percollate --version: 1.2.2

Description

I tested percollate with percollate epub https://www.newyorker.com/culture/annals-of-inquiry/slate-star-codex-and-silicon-valleys-war-against-the-media. The text content in in the generated epub is in an .xhtml file which has a body tag inside a body tag, like this:

        <body>
                <header class="article__header">
                        <h1 class="article__title">Slate Star Codex and Silicon Valley’s War Against the Media</h1>
                        
                        <p class="article__byline">By <span>Gideon Lewis-Kraus</span></p>
                        
                        <p class="article__url">
                                Source:
                                <a class="no-href" href="https://www.newyorker.com/culture/annals-of-inquiry/slate-star-codex-and-silicon-valleys-war-against-the-media">https://www.newyorker.com/culture/annals-of-inquiry/slate-star-codex-and-silicon-valleys-war-against-the-media</a>
                        </p>
                </header>
                <div class="article__content"><body xmlns="http://www.w3.org/1999/xhtml">...

This is an error on epubcheck.

Also, the epub contains several image files, but the HTML has no img tags, so they are not shown.

@danburzo
Copy link
Owner

Thanks for the report! In the PR above I fix the two issues you raised:

  1. Re: the images — initially we were fetching a list of remote resources (images, etc.) to bundle with the EPUB before running Readability on the content, so we ended up bundling images that didn't make the cut.
  2. Re: the <body> element, this was an aspect I missed about the HTML sanitizer (DOMPurify) when I made a series of "optimizations" :-P. (The error was not caught by the automated tests because it's non-fatal.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants