Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Preserve html entities and multiple spaces #59

Open
wants to merge 3 commits into from

2 participants

@brondsem

A few commits to address preservation of html entities and multiple spaces, and fix general escaping that occurs with `backticks`. More details in commit messages

brondsem added some commits
@brondsem brondsem escape &<> so that entities don't disappear during conversion b76cbe3
@brondsem brondsem set code flag properly so that escaping is not done within `backticks` 08e0168
@brondsem brondsem preserve &nbsp; entities
This allows multiple sequential &nbsp; entities to still be
multiple spaces, rather than getting collapsed.

Within `code` blocks, neither a literal space nor a &nbsp; work,
so a unicode nbsp char is used which seems to work in many markdown
renderers.  This fixes the output of the google doc code section.
d7c33ed
@brondsem

Hey, just checking on this. Wondering if this is merge-able or if anything should be changed?

@aaronsw
Owner

Sorry, somehow this get lost in the shuffle. I don't think most users of a program like html2text want HTML in their output, so I'm not comfortable merging a patch that will cause HTML to appear in the output by default.

What's your motivation here?

@brondsem

In the first commit, HTML entities are used so that if your source HTML content is about HTML tags and entities, they will stay escaped and not "devolve" to actual tags and entites. For example &amp;copy; or &lt;b&gt;foo&lt;/b&gt; will no longer turn into &copy; and <b>foo</b> (which render very differently from what the original HTML renders as)

The second commit doesn't add HTML to the markdown output.

The third commit preserves &nbsp; from the HTML into the markdown. This is illustrated in the GoogleDocMassDownload files in which there already was two spaces between "human" and "being". Previously, that was getting collapsed into one space. Now it'll preserve the two spaces. The downside to this is illustrated in the "nbsp.md" in which the &nbsp; entities from the HTML are carried through to the markdown unnecessarily. They could be a regular space and everything would render consistent to the original HTML render. Perhaps this should go under the "escape snob" flag.

My overall rationale for this is that we're importing a large amount of content into a markdown-based system, so we want to maintain accuracy to the original content. Specifically, we're using this within SourceForge as we upgrade projects from our legacy platform to our new platform. Lots of SourceForge forums and ticket content is technical, so there are literal HTML entities we need to preserve, as well as code snippets that have lines indented with many spaces (consecutive   entities).

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 9, 2012
  1. @brondsem
  2. @brondsem
  3. @brondsem

    preserve &nbsp; entities

    brondsem authored
    This allows multiple sequential &nbsp; entities to still be
    multiple spaces, rather than getting collapsed.
    
    Within `code` blocks, neither a literal space nor a &nbsp; work,
    so a unicode nbsp char is used which seems to work in many markdown
    renderers.  This fixes the output of the google doc code section.
This page is out of date. Refresh to see the latest.
View
22 html2text.py
@@ -30,7 +30,7 @@ def has_key(x, y):
import urllib.request as urllib
except:
import urllib
-import optparse, re, sys, codecs, types
+import optparse, re, sys, codecs, types, cgi
try: from textwrap import wrap
except: pass
@@ -266,16 +266,25 @@ def close(self):
if self.unicode_snob:
nbsp = unichr(name2cp('nbsp'))
else:
- nbsp = u' '
+ nbsp = u'&nbsp;'
self.outtext = self.outtext.replace(u'&nbsp_place_holder;', nbsp)
return self.outtext
def handle_charref(self, c):
- self.o(self.charref(c), 1)
+ charref = self.charref(c)
+ if not self.code and not self.pre:
+ charref = cgi.escape(charref)
+ self.o(charref, 1)
def handle_entityref(self, c):
- self.o(self.entityref(c), 1)
+ entityref = self.entityref(c)
+ if not self.code and not self.pre and entityref != '&nbsp_place_holder;':
+ entityref = cgi.escape(entityref)
+ if (self.code or self.pre) and entityref == '&nbsp_place_holder;':
+ # &nbsp; doesn't work in `` and indented blocks
+ entityref = unichr(name2cp('nbsp'))
+ self.o(entityref, 1)
def handle_starttag(self, tag, attrs):
self.handle_tag(tag, attrs, 1)
@@ -453,7 +462,10 @@ def handle_tag(self, tag, attrs, start):
# handle some font attributes, but leave headers clean
self.handle_emphasis(start, tag_style, parent_style)
- if tag in ["code", "tt"] and not self.pre: self.o('`') #TODO: `` `this` ``
+ if tag in ["code", "tt"] and not self.pre:
+ # TODO: `` `this` ``
+ self.o('`')
+ self.code = not self.code
if tag == "abbr":
if start:
self.abbr_title = None
View
10 test/GoogleDocMassDownload.md
@@ -13,16 +13,16 @@ text to separate lists
1. now with numbers
2. the prisoner
1. not an _italic number_
- 2. a **bold human** being
+ 2. a **bold human** &nbsp;being
3. end
**bold**
_italic_
` def func(x):`
-` if x < 1:`
-` return 'a'`
-` return 'b'`
+`   if x < 1:`
+`     return 'a'`
+`   return 'b'`
-Some ` fixed width text` here
+Some ` fixed width text` &nbsp;here
_` italic fixed width text`_
View
10 test/GoogleDocSaved.md
@@ -13,16 +13,16 @@ text to separate lists
1. now with numbers
2. the prisoner
1. not an _italic number_
- 2. a **bold human** being
+ 2. a **bold human** &nbsp;being
3. end
**bold**
_italic_
` def func(x):`
-` if x < 1:`
-` return 'a'`
-` return 'b'`
+`   if x < 1:`
+`     return 'a'`
+`   return 'b'`
-Some ` fixed width text` here
+Some ` fixed width text` &nbsp;here
_` italic fixed width text`_
View
3  test/nbsp.html
@@ -5,7 +5,7 @@
<body>
<h1>NBSP handling test #1</h1>
- <p>In this test all NBSPs will be replaced with ordinary spaces (unicode_snob = False).</p>
+ <p>In this test all NBSP entities will be preserved (unicode_snob = False).</p>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do&nbsp;eiusmod
tempor incididunt ut&nbsp;labore et&nbsp;dolore magna aliqua. Ut&nbsp;enim ad&nbsp;minim veniam,
@@ -17,4 +17,3 @@
proident, sunt in&nbsp;culpa qui officia deserunt mollit anim id&nbsp;est laborum.</p>
</body>
</html>
-
View
18 test/nbsp.md
@@ -1,14 +1,14 @@
# NBSP handling test #1
-In this test all NBSPs will be replaced with ordinary spaces (unicode_snob =
-False).
+In this test all NBSP entities will be preserved (unicode_snob = False).
-Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
-tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
-quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
-consequat.
+Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do&nbsp;eiusmod
+tempor incididunt ut&nbsp;labore et&nbsp;dolore magna aliqua. Ut&nbsp;enim
+ad&nbsp;minim veniam, quis nostrud exercitation ullamco laboris nisi
+ut&nbsp;aliquip ex&nbsp;ea commodo consequat.
-Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
-eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
-in culpa qui officia deserunt mollit anim id est laborum.
+Duis aute irure dolor in&nbsp;reprehenderit in&nbsp;voluptate velit esse
+cillum dolore eu&nbsp;fugiat nulla pariatur. Excepteur sint occaecat cupidatat
+non proident, sunt in&nbsp;culpa qui officia deserunt mollit anim id&nbsp;est
+laborum.
View
16 test/normal.html
@@ -136,5 +136,21 @@
<p>
c:\tmp, \\server\path, \_/, foo\bar, #\#, \\#
</p>
+
+ <code>c:\tmp, \\server\path, \_/, foo\bar, #\#, \\#</code>
+
+ <p>
+ A common entity is &amp;copy;<br>
+ 3 &lt; 6 &amp;&amp; "z" &#62; "a&quot;
+ </p>
+
+ <p>
+ foo&nbsp;&nbsp;&nbsp;bar
+ </p>
+
+ <pre>foo&nbsp;&nbsp;&nbsp;bar</pre>
+
+ <code>foo&nbsp;&nbsp;&nbsp;bar</code>
+
</body>
</html>
View
12 test/normal.md
@@ -52,3 +52,15 @@ not a hr
c:\tmp, \\\server\path, \\_/, foo\bar, #\\#, \\\\#
+`c:\tmp, \\server\path, \_/, foo\bar, #\#, \\#`
+
+A common entity is &amp;copy;
+3 &lt; 6 &amp;&amp; "z" &gt; "a"
+
+foo&nbsp;&nbsp;&nbsp;bar
+
+
+ foo   bar
+
+`foo   bar`
+
View
7 test/normal_escape_snob.html
@@ -133,9 +133,14 @@
<br>
- - -
</p>
-
+
<p>
c:\tmp, \\server\path, \_/, foo\bar, #\#, \\#
</p>
+
+ <p>
+ A common entity is &amp;copy;<br>
+ 3 &lt; 6 &amp;&amp; "z" &#62; "a&quot;
+ </p>
</body>
</html>
View
3  test/normal_escape_snob.md
@@ -53,3 +53,6 @@ not a hr
c:\tmp, \\\server\path, \\\_/, foo\bar, \#\\\#, \\\\\#
+A common entity is &amp;copy;
+3 &lt; 6 &amp;&amp; "z" &gt; "a"
+
View
2  test/run_tests.py
@@ -43,7 +43,7 @@ def test_command(fn, *args):
cmd += [fn]
result = get_baseline(fn)
- actual = subprocess.Popen(cmd, stdout=subprocess.PIPE).stdout.read()
+ actual = subprocess.Popen(cmd, stdout=subprocess.PIPE).stdout.read().decode('utf-8')
if os.name == 'nt':
# Fix the unwanted CR to CRCRLF replacement
Something went wrong with that request. Please try again.