Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entities like   inside block causing 'Failed to parse formatted text' #969

Closed
bentolor opened this issue Nov 24, 2018 · 19 comments
Assignees
Milestone

Comments

@bentolor
Copy link

bentolor commented Nov 24, 2018

Hello and thank you very much for your great work!

I got a problem I'm unable to understand: The snippet below used to work with asciidoctor-pdf but now breaks with the following error:

Failed to parse formatted text: My name<br> my street 123<br> town<br> &nbsp;<br> reference number 1234345
Failed to parse formatted text: My name<br> my street 123<br> town<br> &nbsp;<br> reference number 1234345

Furthermore it now renders in the PDF document as My name<br> my street 123<br> town<br> &nbsp;<br> reference number 1234345 while it still renders the line breaks correctly in asciidoctor.

I'm not sure if this is a duplication of #485 which seems to be to old to be my issue.

Problematic Source

[frame="none",cols=">"]
|======================
| My name +
my street 123 +
town +
&nbsp; +
reference number 	1234345
|======================

Environment

$ asciidoctor --version
Asciidoctor 1.5.8 [https://asciidoctor.org]
Runtime Environment (ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
$ asciidoctor-pdf --version
Asciidoctor PDF 1.5.0.alpha.16 using Asciidoctor 1.5.8 [https://asciidoctor.org]
Runtime Environment (ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
@bentolor bentolor changed the title Regression: Line breaks Regression: Line breaks inside block causing 'Failed to parse formatted text' Nov 25, 2018
@mojavelinux
Copy link
Member

Asciidoctor PDF doesn't seem to recognize "nbsp" as an entity name anymore. I'll look into it.

To workaround this problem, you can do any of the following:

{sp} +

or

{nbsp} +

or

&#160; +

@mojavelinux
Copy link
Member

Honestly, this goes waaaaaaaaaaaaaaaay back. In fact, I can't even find where in the code where &nbsp; is supported.

The only named entities that are supported are as follows:

  • lt
  • gt
  • amp
  • quot
  • apos

This is consistent with XML rules. All other character references must be expressed in decimal (&#160;). (Support for hexadecimal is still pending in #486).

@mojavelinux mojavelinux added this to the support milestone Nov 25, 2018
@mojavelinux mojavelinux self-assigned this Nov 25, 2018
@bentolor
Copy link
Author

Thanks for your quick & helpful feedback. I really did not expect the space to be the issue.

I have an old asciidoc-pdf generated PDF from 2017 of the source file. That's why I assumed a regression.

As all the line breaks were missing, I wrongfully assumed an issue with them rather with my tiny &nbsp.

I now realized that I blatantly assumed that named HTML entities are just part of the Asciidoc syntax. Especially because asciidoctor accepted them happily. The Asciidoc entities like {nbsp} seem to be much more reasonable and the better choice.

@mojavelinux
Copy link
Member

👍

Yeah, Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser. So we've just stuck with decimal character references, which is what the built-in attributes in AsciiDoc produce anyway.

@bentolor bentolor changed the title Regression: Line breaks inside block causing 'Failed to parse formatted text' HTML entities like &nbsp; inside block causing 'Failed to parse formatted text' Nov 27, 2018
@DanielSWolf
Copy link

I'm a bit confused. According to the Asciidoctor User Manual, Asciidoctor supports HTML entities. Here are examples from the manual:

  • &dagger; displays as †
  • &euro; displays as €
  • &loz; displays as ◊

I'm looking for a convenient way to enter symbols such as ≠ without pasting the Unicode character into the Asciidoctor document. I was very happy with &ne; until I found that Asciidoctor-PDF doesn't support them.

@bentolor
Copy link
Author

Okay – I was not aware that Asciidoctor explicitly supports HTML entities and assumed that Asciidoc only accepted them on a opportunistic level.

I'm looking for a convenient way to enter symbols such as ≠ without pasting the Unicode character into the Asciidoctor document.

I think this it is a reasonable motivation to stick to minimum ANSI character set to achieve a better portability of Asciidoc documents and avoid unwanted encoding issues.

As {ne} or {dagger} are not working, I would say this is still a major issue deserving a fix and we probably should reopen this issue.

Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser.

I hate when it comes to this point in software development…

@mojavelinux
Copy link
Member

According to the Asciidoctor User Manual, Asciidoctor supports HTML entities.

There are three ways to define character references in XML: named entities, hexadecimal character references, and decimal character references. Asciidoctor PDF only supports decimal character references. But you can reference any character that's supported by the font.

  • &#8224; displays as † (dagger)
  • &#8364; displays as € (euro)
  • &#9674; displays as ◊ (lozenge)

I tend to use fileformat.info to figure out which decimal to use.

@mojavelinux
Copy link
Member

Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser.

I hate when it comes to this point in software development…

In this case, there's a very strong defense for it. The named entities are not actually standardized, and it's not up to Asciidoctor PDF to maintain that list. There are also many entities that have never been assigned an entity name (in the various systems that do define the names). It's much better to use the unicode character references because those are well-defined, comprehensive, and universal.

@mojavelinux
Copy link
Member

I am willing to add support for the hexadecimal references for consistency, as proposed in #486.

@mojavelinux
Copy link
Member

Although I don't love it, I would be willing to add nbsp given how prevalent it is. I just worry that it becomes a slippery slope.

@DanielSWolf
Copy link

The named entities are not actually standardized

I just did some searching and found the official W3C specification for HTML 5.2. Section 8.5 is titled Named character references and contains the full list of entities.

The reason why I'd like to use named entities is readability and maintainability of the source file. When I see &#945; &#946; &#947; in an AsciiDoc file, I have no idea what these entities mean. I either have to look them up in a list or I need to use some graphical tool that will render the result. And when I want to insert another entity, I have no choice but to look it up.

On the other hand, if that same AsciiDoc file contains &alpha; &beta; &gamma;, I don't have to look up those characters; I know from reading the raw AsciiDoc file what they are. And when writing, it's very easy to memorize those entity names I frequently use.

@mojavelinux
Copy link
Member

Section 8.5 is titled Named character references and contains the full list of entities.

True, but that's not Unicode. It's what I meant when I said that some other systems have standardized a subset of names. But it's no where near the number of glyphs in Unicode, so it's woefully inadequate. The names are also all over the place in length, casing, and language. So, to me, it's not a worthwhile effort.

When I see &#945; &#946; &#947; in an AsciiDoc file, I have no idea what these entities mean

Then you can use AsciiDoc attributes to give them meaningful names (which is the convention in AsciiDoc already).

@bentolor
Copy link
Author

Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser.

I hate when it comes to this point in software development…

In this case, there's a very strong defense for it.…

I'm sorry – obviously I expressed myself wrong: I didn't want to bring you into any defense. Rather the contrary: I meant I hate it as a software developer when it comes to the point that a user has reasonable and plausible expectation for a simple functionality („why can't it just work™?“), but it turns from a developer perspective, that maybe it "could" be done but would cause so much pain and mess that nobody really would like like to have that in the code base.

It's a draw between: "the solution in the code would be much uglier than your workaround"

Although I don't love it, I would be willing to add nbsp given how prevalent it is. I just worry that it becomes a slippery slope.

This could cause even more confusion, as it might encourage the expectation that HTML entities are supported. On the other hand, thew already existing XML entities do the same.

The reason why I'd like to use named entities is readability and maintainability of the source file. When I see &#945; &#946; &#947; in an AsciiDoc file, I have no idea what these entities mean.

+1 from user perspective

@mojavelinux From developer perspective: If a implementation is not in the current scope, would it at least possible to render a warning during parsing? I'm thinking about sth. like: Unknown XML entity "&ne;". Please use numerical notation (&#0123;) and/or Asciidoc attributes for special characters

Though not supported this would guide users who try to convert existing Asciidoc documents into PDFs what needs to be ported to get it working.

@mojavelinux
Copy link
Member

mojavelinux commented Jan 16, 2019

@bentolor I understand your original comment now. Ironically, my use of "strong defense" was also just an idiom. It was meant to imply the simpler functionality might be, in fact, the one we already have.

would it at least possible to render a warning during parsing?

Well, that's a chicken-egg dilemma. If we could emit that message, then we'd have already parsed the name entity and would know how to replace it ;)

What's recognizable is what's familiar. Named entities have largely been discouraged in HTML because they are so inconsistently named, inconsistently supported, and only map a subset of known glyphs. If you use &#160; to represent a non-breaking space, you get used to seeing it, then you know what it is. But better is that you know exactly how to look it up. And if you want to assign it a familiar name, AsciiDoc already has mechanisms to do that.

@oyren
Copy link

oyren commented Mar 6, 2019

+1 from user perspective
I would love to use μ (& mu;) for Microsecond and some other Greek letters.

I come from Org-Mode and wrote notes with it during the lecture. Since I like some things better at Assciidoc, I would like to switch. But during lecture I don't have time to lookup numeric references.

@clbarnes
Copy link

I know this is closed but it may be the most discoverable place for this:

My understanding from the discussion above is that the best way to handle unicode characters in a way which is both readable for a human and parseable for asciidoctor-pdf is to assign commonly-used unicode characters to attributes in the asciidoc header, then use those attributes. Given this, what do you think of my project here:

https://github.com/clbarnes/asciidoc-named-char-refs

which generates a file of attribute definitions for all of the W3C HTML5.2 named character references, which can be include::d in any asciidoc file one might care to write? All of the names have - appended to them so that they don't clash with your other attributes, and blocks of capitals are surrounded with _ in order to get around the case-insensitivity issue.

Suggestions welcome for any improvements, or feedback on why it's a terrible idea, just raise an issue!

@mojavelinux
Copy link
Member

@clbarnes seems like a fine idea to me!

You can even streamline this further by using a preprocessor. A preprocessor can add additional attributes to the document header (just don't read any lines from the reader). It gets loaded like any other extension so you don't have to modify your document.

Here's an example:

Asciidoctor::Extensions.register do
  preprocessor do
    process do |doc, reader|
      doc.set_header_attribute 'mu', '&#956;'
      # etc..
      nil
    end
  end
end

@mojavelinux
Copy link
Member

I've decided I'm going to go ahead and add support for &nbsp; since it's so prevalent and therefore painful to have to work around.

@clbarnes
Copy link

That does look more sensible! And not really any harder to implement, it's basically just a different format of config file after all. That approach does solve another issue I was wrestling with, too - I have a book with some chapters, but sometimes I want to PDF-ise chapters individually. I'd like to write all the necessary headers only once, but only apply them to whichever file is being targeted by asciidoctor-pdf. Currently have a file with rows of -a name=value@, and am cating that into the arguments, but a preprocessor sounds like a much better idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants