HTML entities like   inside block causing 'Failed to parse formatted text' #969

bentolor · 2018-11-24T14:19:34Z

Hello and thank you very much for your great work!

I got a problem I'm unable to understand: The snippet below used to work with asciidoctor-pdf but now breaks with the following error:

Failed to parse formatted text: My name<br> my street 123<br> town<br> &nbsp;<br> reference number 1234345
Failed to parse formatted text: My name<br> my street 123<br> town<br> &nbsp;<br> reference number 1234345

Furthermore it now renders in the PDF document as My name<br> my street 123<br> town<br>  <br> reference number 1234345 while it still renders the line breaks correctly in asciidoctor.

I'm not sure if this is a duplication of #485 which seems to be to old to be my issue.

Problematic Source

[frame="none",cols=">"]
|======================
| My name +
my street 123 +
town +
&nbsp; +
reference number 	1234345
|======================

Environment

$ asciidoctor --version
Asciidoctor 1.5.8 [https://asciidoctor.org]
Runtime Environment (ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)
$ asciidoctor-pdf --version
Asciidoctor PDF 1.5.0.alpha.16 using Asciidoctor 1.5.8 [https://asciidoctor.org]
Runtime Environment (ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu]) (lc:UTF-8 fs:UTF-8 in:UTF-8 ex:UTF-8)

The text was updated successfully, but these errors were encountered:

mojavelinux · 2018-11-25T16:54:53Z

Asciidoctor PDF doesn't seem to recognize "nbsp" as an entity name anymore. I'll look into it.

To workaround this problem, you can do any of the following:

{sp} +

or

{nbsp} +

or

&#160; +

mojavelinux · 2018-11-25T17:01:17Z

Honestly, this goes waaaaaaaaaaaaaaaay back. In fact, I can't even find where in the code where   is supported.

The only named entities that are supported are as follows:

lt
gt
amp
quot
apos

This is consistent with XML rules. All other character references must be expressed in decimal ( ). (Support for hexadecimal is still pending in #486).

bentolor · 2018-11-26T08:48:41Z

Thanks for your quick & helpful feedback. I really did not expect the space to be the issue.

I have an old asciidoc-pdf generated PDF from 2017 of the source file. That's why I assumed a regression.

As all the line breaks were missing, I wrongfully assumed an issue with them rather with my tiny &nbsp.

I now realized that I blatantly assumed that named HTML entities are just part of the Asciidoc syntax. Especially because asciidoctor accepted them happily. The Asciidoc entities like {nbsp} seem to be much more reasonable and the better choice.

mojavelinux · 2018-11-27T03:34:01Z

👍

Yeah, Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser. So we've just stuck with decimal character references, which is what the built-in attributes in AsciiDoc produce anyway.

DanielSWolf · 2019-01-15T12:39:10Z

I'm a bit confused. According to the Asciidoctor User Manual, Asciidoctor supports HTML entities. Here are examples from the manual:

&dagger; displays as †

€ displays as €

&loz; displays as ◊

I'm looking for a convenient way to enter symbols such as ≠ without pasting the Unicode character into the Asciidoctor document. I was very happy with ≠ until I found that Asciidoctor-PDF doesn't support them.

bentolor · 2019-01-15T12:55:08Z

Okay – I was not aware that Asciidoctor explicitly supports HTML entities and assumed that Asciidoc only accepted them on a opportunistic level.

I'm looking for a convenient way to enter symbols such as ≠ without pasting the Unicode character into the Asciidoctor document.

I think this it is a reasonable motivation to stick to minimum ANSI character set to achieve a better portability of Asciidoc documents and avoid unwanted encoding issues.

As {ne} or {dagger} are not working, I would say this is still a major issue deserving a fix and we probably should reopen this issue.

Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser.

I hate when it comes to this point in software development…

mojavelinux · 2019-01-16T02:39:10Z

According to the Asciidoctor User Manual, Asciidoctor supports HTML entities.

There are three ways to define character references in XML: named entities, hexadecimal character references, and decimal character references. Asciidoctor PDF only supports decimal character references. But you can reference any character that's supported by the font.

† displays as † (dagger)
€ displays as € (euro)
◊ displays as ◊ (lozenge)

I tend to use fileformat.info to figure out which decimal to use.

mojavelinux · 2019-01-16T02:42:17Z

Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser.

I hate when it comes to this point in software development…

In this case, there's a very strong defense for it. The named entities are not actually standardized, and it's not up to Asciidoctor PDF to maintain that list. There are also many entities that have never been assigned an entity name (in the various systems that do define the names). It's much better to use the unicode character references because those are well-defined, comprehensive, and universal.

mojavelinux · 2019-01-16T02:43:16Z

I am willing to add support for the hexadecimal references for consistency, as proposed in #486.

mojavelinux · 2019-01-16T02:44:13Z

Although I don't love it, I would be willing to add nbsp given how prevalent it is. I just worry that it becomes a slippery slope.

DanielSWolf · 2019-01-16T09:17:57Z

The named entities are not actually standardized

I just did some searching and found the official W3C specification for HTML 5.2. Section 8.5 is titled Named character references and contains the full list of entities.

The reason why I'd like to use named entities is readability and maintainability of the source file. When I see α β γ in an AsciiDoc file, I have no idea what these entities mean. I either have to look them up in a list or I need to use some graphical tool that will render the result. And when I want to insert another entity, I have no choice but to look it up.

On the other hand, if that same AsciiDoc file contains α β γ, I don't have to look up those characters; I know from reading the raw AsciiDoc file what they are. And when writing, it's very easy to memorize those entity names I frequently use.

mojavelinux · 2019-01-16T09:34:25Z

Section 8.5 is titled Named character references and contains the full list of entities.

True, but that's not Unicode. It's what I meant when I said that some other systems have standardized a subset of names. But it's no where near the number of glyphs in Unicode, so it's woefully inadequate. The names are also all over the place in length, casing, and language. So, to me, it's not a worthwhile effort.

When I see α β γ in an AsciiDoc file, I have no idea what these entities mean

Then you can use AsciiDoc attributes to give them meaningful names (which is the convention in AsciiDoc already).

bentolor · 2019-01-16T15:08:04Z

Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser.

I hate when it comes to this point in software development…

In this case, there's a very strong defense for it.…

I'm sorry – obviously I expressed myself wrong: I didn't want to bring you into any defense. Rather the contrary: I meant I hate it as a software developer when it comes to the point that a user has reasonable and plausible expectation for a simple functionality („why can't it just work™?“), but it turns from a developer perspective, that maybe it "could" be done but would cause so much pain and mess that nobody really would like like to have that in the code base.

It's a draw between: "the solution in the code would be much uglier than your workaround"

Although I don't love it, I would be willing to add nbsp given how prevalent it is. I just worry that it becomes a slippery slope.

This could cause even more confusion, as it might encourage the expectation that HTML entities are supported. On the other hand, thew already existing XML entities do the same.

The reason why I'd like to use named entities is readability and maintainability of the source file. When I see α β γ in an AsciiDoc file, I have no idea what these entities mean.

+1 from user perspective

@mojavelinux From developer perspective: If a implementation is not in the current scope, would it at least possible to render a warning during parsing? I'm thinking about sth. like: Unknown XML entity "≠". Please use numerical notation ({) and/or Asciidoc attributes for special characters

Though not supported this would guide users who try to convert existing Asciidoc documents into PDFs what needs to be ported to get it working.

mojavelinux · 2019-01-16T20:05:14Z

@bentolor I understand your original comment now. Ironically, my use of "strong defense" was also just an idiom. It was meant to imply the simpler functionality might be, in fact, the one we already have.

would it at least possible to render a warning during parsing?

Well, that's a chicken-egg dilemma. If we could emit that message, then we'd have already parsed the name entity and would know how to replace it ;)

What's recognizable is what's familiar. Named entities have largely been discouraged in HTML because they are so inconsistently named, inconsistently supported, and only map a subset of known glyphs. If you use   to represent a non-breaking space, you get used to seeing it, then you know what it is. But better is that you know exactly how to look it up. And if you want to assign it a familiar name, AsciiDoc already has mechanisms to do that.

oyren · 2019-03-06T19:31:56Z

+1 from user perspective
I would love to use μ (& mu;) for Microsecond and some other Greek letters.

I come from Org-Mode and wrote notes with it during the lecture. Since I like some things better at Assciidoc, I would like to switch. But during lecture I don't have time to lookup numeric references.

clbarnes · 2019-04-19T18:10:16Z

I know this is closed but it may be the most discoverable place for this:

My understanding from the discussion above is that the best way to handle unicode characters in a way which is both readable for a human and parseable for asciidoctor-pdf is to assign commonly-used unicode characters to attributes in the asciidoc header, then use those attributes. Given this, what do you think of my project here:

https://github.com/clbarnes/asciidoc-named-char-refs

which generates a file of attribute definitions for all of the W3C HTML5.2 named character references, which can be include::d in any asciidoc file one might care to write? All of the names have - appended to them so that they don't clash with your other attributes, and blocks of capitals are surrounded with _ in order to get around the case-insensitivity issue.

Suggestions welcome for any improvements, or feedback on why it's a terrible idea, just raise an issue!

mojavelinux · 2019-04-20T09:12:53Z

@clbarnes seems like a fine idea to me!

You can even streamline this further by using a preprocessor. A preprocessor can add additional attributes to the document header (just don't read any lines from the reader). It gets loaded like any other extension so you don't have to modify your document.

Here's an example:

Asciidoctor::Extensions.register do
  preprocessor do
    process do |doc, reader|
      doc.set_header_attribute 'mu', '&#956;'
      # etc..
      nil
    end
  end
end

mojavelinux · 2019-04-20T09:14:02Z

I've decided I'm going to go ahead and add support for   since it's so prevalent and therefore painful to have to work around.

clbarnes · 2019-04-20T15:10:13Z

That does look more sensible! And not really any harder to implement, it's basically just a different format of config file after all. That approach does solve another issue I was wrestling with, too - I have a book with some chapters, but sometimes I want to PDF-ise chapters individually. I'd like to write all the necessary headers only once, but only apply them to whichever file is being targeted by asciidoctor-pdf. Currently have a file with rows of -a name=value@, and am cating that into the arguments, but a preprocessor sounds like a much better idea.

bentolor changed the title ~~Regression: Line breaks~~ Regression: Line breaks inside block causing 'Failed to parse formatted text' Nov 25, 2018

mojavelinux added this to the support milestone Nov 25, 2018

mojavelinux self-assigned this Nov 25, 2018

mojavelinux closed this as completed Nov 27, 2018

bentolor changed the title ~~Regression: Line breaks inside block causing 'Failed to parse formatted text'~~ HTML entities like   inside block causing 'Failed to parse formatted text' Nov 27, 2018

ggrossetie mentioned this issue Oct 12, 2019

Rendering Greek letters when no math is needed asciidoctor/asciidoctor#3451

Open

sonrad10 mentioned this issue Apr 21, 2021

Errors with asciidoctor-pdf riboseinc/asciidoctor-bibliography#111

Open

kennypete mentioned this issue May 31, 2022

An alternative way to achieve the same result clbarnes/asciidoc-named-char-refs#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML entities like   inside block causing 'Failed to parse formatted text' #969

HTML entities like   inside block causing 'Failed to parse formatted text' #969

bentolor commented Nov 24, 2018 •

edited

Loading

mojavelinux commented Nov 25, 2018

mojavelinux commented Nov 25, 2018

bentolor commented Nov 26, 2018

mojavelinux commented Nov 27, 2018

DanielSWolf commented Jan 15, 2019

bentolor commented Jan 15, 2019

mojavelinux commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

DanielSWolf commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

bentolor commented Jan 16, 2019

mojavelinux commented Jan 16, 2019 •

edited

Loading

oyren commented Mar 6, 2019 •

edited

Loading

clbarnes commented Apr 19, 2019

mojavelinux commented Apr 20, 2019

mojavelinux commented Apr 20, 2019

clbarnes commented Apr 20, 2019

HTML entities like &nbsp; inside block causing 'Failed to parse formatted text' #969

HTML entities like &nbsp; inside block causing 'Failed to parse formatted text' #969

Comments

bentolor commented Nov 24, 2018 • edited Loading

Problematic Source

Environment

mojavelinux commented Nov 25, 2018

mojavelinux commented Nov 25, 2018

bentolor commented Nov 26, 2018

mojavelinux commented Nov 27, 2018

DanielSWolf commented Jan 15, 2019

bentolor commented Jan 15, 2019

mojavelinux commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

DanielSWolf commented Jan 16, 2019

mojavelinux commented Jan 16, 2019

bentolor commented Jan 16, 2019

mojavelinux commented Jan 16, 2019 • edited Loading

oyren commented Mar 6, 2019 • edited Loading

clbarnes commented Apr 19, 2019

mojavelinux commented Apr 20, 2019

mojavelinux commented Apr 20, 2019

clbarnes commented Apr 20, 2019

HTML entities like inside block causing 'Failed to parse formatted text' #969

HTML entities like inside block causing 'Failed to parse formatted text' #969

bentolor commented Nov 24, 2018 •

edited

Loading

mojavelinux commented Jan 16, 2019 •

edited

Loading

oyren commented Mar 6, 2019 •

edited

Loading