-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML entities like inside block causing 'Failed to parse formatted text' #969
Comments
Asciidoctor PDF doesn't seem to recognize "nbsp" as an entity name anymore. I'll look into it. To workaround this problem, you can do any of the following:
or
or
|
Honestly, this goes waaaaaaaaaaaaaaaay back. In fact, I can't even find where in the code where The only named entities that are supported are as follows:
This is consistent with XML rules. All other character references must be expressed in decimal ( |
Thanks for your quick & helpful feedback. I really did not expect the space to be the issue. I have an old asciidoc-pdf generated PDF from 2017 of the source file. That's why I assumed a regression. As all the line breaks were missing, I wrongfully assumed an issue with them rather with my tiny I now realized that I blatantly assumed that named HTML entities are just part of the Asciidoc syntax. Especially because asciidoctor accepted them happily. The Asciidoc entities like |
👍 Yeah, Asciidoctor PDF doesn't support all named entities because, frankly, it would over complicate the parser. So we've just stuck with decimal character references, which is what the built-in attributes in AsciiDoc produce anyway. |
I'm a bit confused. According to the Asciidoctor User Manual, Asciidoctor supports HTML entities. Here are examples from the manual:
I'm looking for a convenient way to enter symbols such as ≠ without pasting the Unicode character into the Asciidoctor document. I was very happy with |
Okay – I was not aware that Asciidoctor explicitly supports HTML entities and assumed that Asciidoc only accepted them on a opportunistic level.
I think this it is a reasonable motivation to stick to minimum ANSI character set to achieve a better portability of Asciidoc documents and avoid unwanted encoding issues. As
I hate when it comes to this point in software development… |
There are three ways to define character references in XML: named entities, hexadecimal character references, and decimal character references. Asciidoctor PDF only supports decimal character references. But you can reference any character that's supported by the font.
I tend to use fileformat.info to figure out which decimal to use. |
In this case, there's a very strong defense for it. The named entities are not actually standardized, and it's not up to Asciidoctor PDF to maintain that list. There are also many entities that have never been assigned an entity name (in the various systems that do define the names). It's much better to use the unicode character references because those are well-defined, comprehensive, and universal. |
I am willing to add support for the hexadecimal references for consistency, as proposed in #486. |
Although I don't love it, I would be willing to add |
I just did some searching and found the official W3C specification for HTML 5.2. Section 8.5 is titled Named character references and contains the full list of entities. The reason why I'd like to use named entities is readability and maintainability of the source file. When I see On the other hand, if that same AsciiDoc file contains |
True, but that's not Unicode. It's what I meant when I said that some other systems have standardized a subset of names. But it's no where near the number of glyphs in Unicode, so it's woefully inadequate. The names are also all over the place in length, casing, and language. So, to me, it's not a worthwhile effort.
Then you can use AsciiDoc attributes to give them meaningful names (which is the convention in AsciiDoc already). |
I'm sorry – obviously I expressed myself wrong: I didn't want to bring you into any defense. Rather the contrary: I meant I hate it as a software developer when it comes to the point that a user has reasonable and plausible expectation for a simple functionality („why can't it just work™?“), but it turns from a developer perspective, that maybe it "could" be done but would cause so much pain and mess that nobody really would like like to have that in the code base. It's a draw between: "the solution in the code would be much uglier than your workaround"
This could cause even more confusion, as it might encourage the expectation that HTML entities are supported. On the other hand, thew already existing XML entities do the same.
+1 from user perspective @mojavelinux From developer perspective: If a implementation is not in the current scope, would it at least possible to render a warning during parsing? I'm thinking about sth. like: Though not supported this would guide users who try to convert existing Asciidoc documents into PDFs what needs to be ported to get it working. |
@bentolor I understand your original comment now. Ironically, my use of "strong defense" was also just an idiom. It was meant to imply the simpler functionality might be, in fact, the one we already have.
Well, that's a chicken-egg dilemma. If we could emit that message, then we'd have already parsed the name entity and would know how to replace it ;) What's recognizable is what's familiar. Named entities have largely been discouraged in HTML because they are so inconsistently named, inconsistently supported, and only map a subset of known glyphs. If you use |
+1 from user perspective I come from Org-Mode and wrote notes with it during the lecture. Since I like some things better at Assciidoc, I would like to switch. But during lecture I don't have time to lookup numeric references. |
I know this is closed but it may be the most discoverable place for this: My understanding from the discussion above is that the best way to handle unicode characters in a way which is both readable for a human and parseable for asciidoctor-pdf is to assign commonly-used unicode characters to attributes in the asciidoc header, then use those attributes. Given this, what do you think of my project here: https://github.com/clbarnes/asciidoc-named-char-refs which generates a file of attribute definitions for all of the W3C HTML5.2 named character references, which can be Suggestions welcome for any improvements, or feedback on why it's a terrible idea, just raise an issue! |
@clbarnes seems like a fine idea to me! You can even streamline this further by using a preprocessor. A preprocessor can add additional attributes to the document header (just don't read any lines from the reader). It gets loaded like any other extension so you don't have to modify your document. Here's an example: Asciidoctor::Extensions.register do
preprocessor do
process do |doc, reader|
doc.set_header_attribute 'mu', 'μ'
# etc..
nil
end
end
end |
I've decided I'm going to go ahead and add support for |
That does look more sensible! And not really any harder to implement, it's basically just a different format of config file after all. That approach does solve another issue I was wrestling with, too - I have a book with some chapters, but sometimes I want to PDF-ise chapters individually. I'd like to write all the necessary headers only once, but only apply them to whichever file is being targeted by asciidoctor-pdf. Currently have a file with rows of |
Hello and thank you very much for your great work!
I got a problem I'm unable to understand: The snippet below used to work with asciidoctor-pdf but now breaks with the following error:
Furthermore it now renders in the PDF document as
My name<br> my street 123<br> town<br> <br> reference number 1234345
while it still renders the line breaks correctly in asciidoctor.I'm not sure if this is a duplication of #485 which seems to be to old to be my issue.
Problematic Source
Environment
The text was updated successfully, but these errors were encountered: