Math parsing problem of \boldsymbol{0} ? #926

dginev commented Feb 1, 2018 • edited

 I don't think that's just it, have to rebuild my Perl installation right now (because of fun reasons with version compatibility) and will then prep a minimal example for my theory. My current understanding is that the binmode(STDOUT, ":encoding(UTF-8)") really means explicitly encode to unicode before printing to the stream, which is what you want to do if you are working with unicode strings that have been decoded into Perl's internal form - those can not be printed directly without an encoding pass, or will come out garbled. It may be the case that the exact set of steps that produce the bold math unicode symbols run an explicit encode call already when depositing the string to be absorbed in the Document, which would then lead to a double-encode error on print to STDOUT
Collaborator

dginev commented Feb 1, 2018 • edited

 If we were to only "mark" the output as unicode, it seems the correct way would be to specify :utf8 instead of the :encoding(UTF-8) as the docs explain Not getting any simpler, but maybe getting closer to understandable... EDIT: Just tried the :utf8 variant and it is the same effect as the :encoding one. Removing the binmode gets the bold 0 symbol to print correctly.
Collaborator

dginev commented Feb 1, 2018

 Also, not specific to latexmlmath, can reproduce the garbled output in latexmlc via: latexmlc --format=html5 --whatsin=fragment "literal:$\displaystyle\boldsymbol{0}$" --preload=amsmath 
Owner

brucemiller commented Feb 1, 2018

 Don't get sidetracked with plane1 or even fancy math; you'll get the same problem with \pi
Collaborator

dginev commented Feb 1, 2018 • edited

 Actually, now that I try plain text, all unicode is coming out double-encoded to STDOUT. Huh. File output still works reliably. And I verified this is seen both when reading from STDIN or a file.
Collaborator

dginev commented Feb 1, 2018 • edited

 Ok, so I now retried with the exact example I fixed in #918 since I am starting to go crazy. I can confirm my PR fixed the exact error reported in that issue, one of: processing started Fri Feb 2 00:28:23 2018:12: parser error : Input is not proper UTF-8, indicate encoding !  And can also confirm that the argument-free calls of latexml test.tex and latexmlc test.tex print correct, valid unicode on STDOUT. However, I think post-processing is somehow not as robust as the core processing for the encoding print to stdout. Modifying the command to latexmlc --format=html5 test.tex leads to the unicode getting double-encoded. So this gives me something to diagnose with. Pretty certain it is not executable-specific as it can be seen in both latexmlc and latexmlmath, and now it seems clear it relates to the HTML5 post-processing toolchain
Collaborator

dginev commented Feb 1, 2018

 Wow, so all of the cases that looked correct on STDOUT were indeed printed not by libxml's serializer but by your serialize_aux in Document.pm ! Ok... Then maybe there is a solution where we never use binmode at all, but instead add an explicit encode to UTF-8 to the result of the serialize_aux call? Quite the ball of yarn to untangle
Owner

brucemiller commented Feb 2, 2018

 Actually, not so hard, I think. For the record, an old, but complete, if painful, summary of issues with Unicode (not even necessarily restricted to Perl) https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default. The relevant takeaway is that all communication with the outside world is via bytes, not characters. Most likely the confusing thing here is that there are 3 distinct "Document" classes in play, (LaTeXML::Core, LaTeXML::Post & XML::LibXML) that each have a toString method. Only the latter's toString "helpfully" does the encoding for you; the serialization is bytes not characters. Note that we pretty consistently only use STDOUT (and files we've opened ourselves) for outputting "real" data, and that is done from relatively few places. So I think the key is that whenever we output real data, we want to be concious of, and explicit about, whether it is in character form, or already encoded bytes; if the former, we encode. In fact, I'd rather not even risk the "sticky" binmode encoding layer, and just explicitly encode the serialization, at the point where it's output, if & when needed. [Aside: STDERR is for less critical, informative, messages. And since in the program generally, we want to deal with characters, not bytes, it's reasonable and convenient to use binmode to set STDERR to utf8] So, that's what I've implemented and it seems to work. Please verify. FWIW: your 1st commit for #918 was fine, but the 2nd commit went too far.
Owner

brucemiller commented Feb 2, 2018

 Oh, @dginev: I wanted to point out that there's one case of binmode on STDOUT in latexmlc that I wasn't sure about. I didn't want to track down all the possibiilities that $result might be :> The binmode is conditional on $is_archive, but maybe a more explicit \$needs_encode might be appropriate?
Collaborator

dginev commented Feb 3, 2018

 Cool, this works with the examples, and indeed latexmlc needs a fix for stdout. Making a PR.

Owner

brucemiller commented Feb 3, 2018

 OK, hopefully the encoding is dealt with correctly now. Thanks all;

