Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMl enities get out of anchor text #31

Closed
Alir3z4 opened this issue Oct 4, 2014 · 11 comments · Fixed by #77
Closed

HTMl enities get out of anchor text #31

Alir3z4 opened this issue Oct 4, 2014 · 11 comments · Fixed by #77
Labels

Comments

@Alir3z4
Copy link
Owner

Alir3z4 commented Oct 4, 2014

@szepeviktor:

another strange behaviour

<a href="http://thth" class="nolink" style="text-decoration:none;color:inherit;cursor:default;">&#225;ll&#225;s: Country Manager</a>
a[llas: Country Manager](http://thth)
@Alir3z4 Alir3z4 added the bug label Oct 4, 2014
@Alir3z4
Copy link
Owner Author

Alir3z4 commented Oct 4, 2014

szepeviktor:

I expected [állás: Country Manager](http://thth)
Are accents converted by default?

@Alir3z4
Copy link
Owner Author

Alir3z4 commented Oct 4, 2014

szepeviktor it's because html2text like to see the world as ASCII only and seems to think there's only English with 26 letters only, with giving mercy to some characters too.

Well this need to be fixed, text should be be utf-8 and encoded too.

In [23]: html2text.html2text(u'<a href="go.com" class="nolink">állás: Country Manager]</a>')
Out[23]: u'[\xe1ll\xe1s: Country Manager]](go.com)\n\n'

or just simply:

In [37]: print html2text.html2text(u'<a href="go.com" class="nolink">állás: Country Manager]</a>')
[állás: Country Manager]](go.com)

I guess we need to encode the text before feeding it to html2text, right?

@Alir3z4
Copy link
Owner Author

Alir3z4 commented Oct 4, 2014

szepeviktor:

Yes. Only a one-byte HTML entity will get outside the anchor, not an UTF-8 character.

@Alir3z4
Copy link
Owner Author

Alir3z4 commented Oct 4, 2014

I guess it should be handled by html2text, I mean encoding the input to
utf8.

Feel free to patch and make it by default.

@Alir3z4 Alir3z4 changed the title HTMl enities get out of anchor tex HTMl enities get out of anchor text Oct 24, 2014
theSage21 added a commit to theSage21/html2text that referenced this issue Jun 14, 2015
Changed call order of handle_charref and handle_entityref.

Error was caused due to first element of link title being charref and thus
calling handle_charref instead of handle_data where the '[' is inserted.
@theSage21
Copy link
Collaborator

This is only caused when the link text begins with a char reference.
<a href="http://thth">&#225;ll&#225;s: Country Manager</a> causes the bug.
<a href="http://thth"> &#225;ll&#225;s: Country Manager</a> translates correctly to
[ állás: Country Manager](http://thth)

This is because the first call is to handle_charref and after that handle_data and handle_data is the function that adds the '['.

Fixed in #77

@theSage21
Copy link
Collaborator

The issue is solved in #77. Can we close this?

@szepeviktor
Copy link
Contributor

Please wait till I get home, and confirm.

@theSage21
Copy link
Collaborator

@szepeviktor sure sure. it is morning again and I have to sleep. See you on the other side of the sun. 😄

@szepeviktor
Copy link
Contributor

There is a problem:

$ echo $LANG
en_US.UTF-8
$ echo '<a href="http://thth" class="nolink" style="text-decoration:none;color:inherit;cursor:default;">&#225;ll&#225;s: Country Manager</a>'| ./html2text
[allas: Country Manager](http://thth)

Shouldn't &#225; be á?

@theSage21
Copy link
Collaborator

@szepeviktor The command line by default works with ASCII. Hence they are being converted to ASCII equivalents. As of now there is no command line option for unicode. Try this.

>>>import html2text as h2t
>>>H2T = h2t.HTML2Text()
>>>html = '<a href="http://thth" class="nolink" style="text-decoration:none;color:inherit;cursor:default;">&#225;ll&#225;s: Country Manager</a>'
>>>md_ascii = H2T.handle(html)
>>>H2T.unicode_snob = True
>>>md_unicode = H2T.handle(html)
>>>print(md_ascii)
[allas: Country Manager](http://thth)


>>>print(md_unicode)
[állás: Country Manager](http://thth)


>>>

@szepeviktor
Copy link
Contributor

Thank you.
Please merge #77.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants