Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Preserve html entities and multiple spaces #59

wants to merge 3 commits into


None yet
2 participants

brondsem commented Nov 9, 2012

A few commits to address preservation of html entities and multiple spaces, and fix general escaping that occurs with backticks. More details in commit messages

brondsem added some commits Nov 9, 2012

@brondsem brondsem escape &<> so that entities don't disappear during conversion b76cbe3
@brondsem brondsem set code flag properly so that escaping is not done within `backticks` 08e0168
@brondsem brondsem preserve &nbsp; entities
This allows multiple sequential &nbsp; entities to still be
multiple spaces, rather than getting collapsed.

Within `code` blocks, neither a literal space nor a &nbsp; work,
so a unicode nbsp char is used which seems to work in many markdown
renderers.  This fixes the output of the google doc code section.

brondsem commented Dec 13, 2012

Hey, just checking on this. Wondering if this is merge-able or if anything should be changed?


aaronsw commented Dec 13, 2012

Sorry, somehow this get lost in the shuffle. I don't think most users of a program like html2text want HTML in their output, so I'm not comfortable merging a patch that will cause HTML to appear in the output by default.

What's your motivation here?


brondsem commented Dec 13, 2012

In the first commit, HTML entities are used so that if your source HTML content is about HTML tags and entities, they will stay escaped and not "devolve" to actual tags and entites. For example &amp;copy; or &lt;b&gt;foo&lt;/b&gt; will no longer turn into &copy; and <b>foo</b> (which render very differently from what the original HTML renders as)

The second commit doesn't add HTML to the markdown output.

The third commit preserves &nbsp; from the HTML into the markdown. This is illustrated in the GoogleDocMassDownload files in which there already was two spaces between "human" and "being". Previously, that was getting collapsed into one space. Now it'll preserve the two spaces. The downside to this is illustrated in the "nbsp.md" in which the &nbsp; entities from the HTML are carried through to the markdown unnecessarily. They could be a regular space and everything would render consistent to the original HTML render. Perhaps this should go under the "escape snob" flag.

My overall rationale for this is that we're importing a large amount of content into a markdown-based system, so we want to maintain accuracy to the original content. Specifically, we're using this within SourceForge as we upgrade projects from our legacy platform to our new platform. Lots of SourceForge forums and ticket content is technical, so there are literal HTML entities we need to preserve, as well as code snippets that have lines indented with many spaces (consecutive   entities).


@pombredanne pombredanne pushed a commit to pombredanne/html2text that referenced this pull request Oct 10, 2015

@Alir3z4 Alir3z4 Merge pull request #59 from smblackburn/master
Support for image sizing using raw html

Thanks @smblackburn

@theSage21 theSage21 referenced this pull request in Alir3z4/html2text Jul 11, 2016


unexpanded &lt; &gt; &amp; #109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment