Preserve html entities and multiple spaces #59

Open
wants to merge 3 commits into
from

Projects

None yet

2 participants

@brondsem
Contributor
brondsem commented Nov 9, 2012

A few commits to address preservation of html entities and multiple spaces, and fix general escaping that occurs with backticks. More details in commit messages

brondsem added some commits Nov 9, 2012
@brondsem brondsem escape &<> so that entities don't disappear during conversion b76cbe3
@brondsem brondsem set code flag properly so that escaping is not done within `backticks` 08e0168
@brondsem brondsem preserve &nbsp; entities
This allows multiple sequential &nbsp; entities to still be
multiple spaces, rather than getting collapsed.

Within `code` blocks, neither a literal space nor a &nbsp; work,
so a unicode nbsp char is used which seems to work in many markdown
renderers.  This fixes the output of the google doc code section.
d7c33ed
@brondsem
Contributor

Hey, just checking on this. Wondering if this is merge-able or if anything should be changed?

@aaronsw
Owner
aaronsw commented Dec 13, 2012

Sorry, somehow this get lost in the shuffle. I don't think most users of a program like html2text want HTML in their output, so I'm not comfortable merging a patch that will cause HTML to appear in the output by default.

What's your motivation here?

@brondsem
Contributor

In the first commit, HTML entities are used so that if your source HTML content is about HTML tags and entities, they will stay escaped and not "devolve" to actual tags and entites. For example &amp;copy; or &lt;b&gt;foo&lt;/b&gt; will no longer turn into &copy; and <b>foo</b> (which render very differently from what the original HTML renders as)

The second commit doesn't add HTML to the markdown output.

The third commit preserves &nbsp; from the HTML into the markdown. This is illustrated in the GoogleDocMassDownload files in which there already was two spaces between "human" and "being". Previously, that was getting collapsed into one space. Now it'll preserve the two spaces. The downside to this is illustrated in the "nbsp.md" in which the &nbsp; entities from the HTML are carried through to the markdown unnecessarily. They could be a regular space and everything would render consistent to the original HTML render. Perhaps this should go under the "escape snob" flag.

My overall rationale for this is that we're importing a large amount of content into a markdown-based system, so we want to maintain accuracy to the original content. Specifically, we're using this within SourceForge as we upgrade projects from our legacy platform to our new platform. Lots of SourceForge forums and ticket content is technical, so there are literal HTML entities we need to preserve, as well as code snippets that have lines indented with many spaces (consecutive   entities).

Thanks

@theSage21 theSage21 referenced this pull request in Alir3z4/html2text Jul 11, 2016
Open

unexpanded &lt; &gt; &amp; #109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment