New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Math parsing problem of `\boldsymbol{0}` ? #926

Closed
asmaier opened this Issue Jan 12, 2018 · 21 comments

Comments

Projects
None yet
3 participants
@asmaier

asmaier commented Jan 12, 2018

When converting the following formula to MathML (Mac OS X, v0.8.2 of LaTeXML), it seems the number 0 somehow get's lost:

$ latexmlmath "\displaystyle\boldsymbol{0}" --preload=amsmath
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\displaystyle\boldsymbol{0}" display="block">
  <mn/>
</math>

One can see in the MathML output, that the number 0 is missing. It doesn't happen with other numbers like \boldsymbol{1}:

$ latexmlmath "\displaystyle\boldsymbol{1}" --preload=amsmath
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\displaystyle\boldsymbol{1}" display="block">
  <mn>𝟏</mn>
</math>

It also doesn't happen, if I change the formula and write the 0 without \boldsymbol:

$ latexmlmath "\displaystyle 0" --preload=amsmath
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\displaystyle 0" display="block">
  <mn>0</mn>
</math>

What is going on?

@dginev dginev added this to the LaTeXML-0.8.3 milestone Jan 12, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Jan 12, 2018

This should be an easy fix for a classic Perl error where you have an if conditional testing if something is defined, but the literal string "0" ends up false in Perl. Sometimes they slip in, it's a great catch

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Jan 31, 2018

Yeah; I have no doubt that this happened, and probably exactly how @dginev says. But it seems that it's already been fixed. If that turns out not to be true, please reopen. Thanks for the report.

@asmaier

This comment has been minimized.

asmaier commented Jan 31, 2018

When I try the same example with the LaTeXML version from HEAD, I do see the following:

$ latexmlmath "\displaystyle\boldsymbol{0}" --preload=amsmath
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\displaystyle\boldsymbol{0}" display="block">
  <mn>�</mn>
</math>

and also

$ latexmlmath "\displaystyle\boldsymbol{1}" --preload=amsmath
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\displaystyle\boldsymbol{1}" display="block">
  <mn>�</mn>
</math>

So I don't think this issue is fixed correctly.

@brucemiller brucemiller reopened this Feb 1, 2018

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 1, 2018

Well, if you output to a file, it seems to be the correct unicode, it's just mangled on stdout. And this seems due to the @dginev's addition of binmode in latexmlmath which seems to me at best to be correct and at worst to be harmless. But apparently it isn't? I'm a bit baffled. Any intuitions on this @dginev?

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Yeah, it is obviously double encoded, and upon inspecting the source of latexmlmath, the culprit is also rather proudly announcing itself, namely:

my $newdoc = XML::LibXML::Document->new("1.0", "UTF-8");

Here is a standalone test snippet that demonstrates the double encoding:

use XML::LibXML;

my $newdoc = XML::LibXML::Document->new("1.0", "UTF-8");
my $newel = XML::LibXML::Element->new("проба");
$newdoc->setDocumentElement($newel);
my $serialized = $newdoc->toString(1);

print "1. \n", $serialized,"\n\n";

binmode(STDOUT, ":encoding(UTF-8)");

print "2. \n", $serialized,"\n\n";

which outputs:

1. 
<?xml version="1.0" encoding="UTF-8"?>
<проба/>


2. 
<?xml version="1.0" encoding="UTF-8"?>
<п�оба/>

First being correct, second double-encoded. So I guess latexmlmath was correct before my supposed "fix", as the serialization of libxml took into account the encoding of the document? Confusing...

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

LaTeXML::Document initializes the document in the exact same way:

my $doc = XML::LibXML::Document->new("1.0", "UTF-8");

So I am starting to go back to confused about why the serialization doesn't encode correctly...

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Ah, I think I got it.

The difference is that the Unicode for the math case was generated by LaTeXML - and I believe was already encoded internally as UTF-8, which is why it is getting the double encoding.

My fixes for STDOUT printing were specifically for pass-through content that came from input as Unicode, and got decoded by latexml for internal use.

In other words, we are handling inconsistently encodings of content generated internally by latexml, and content entering from input TeX.

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Tracing further, I think it boils down to having %plane1map and similar lookup tables in MathML.pm that are already encoded as Unicode, and end up double-encoded on printing to STDOUT.

Not brave enough to suggest which part to refactor in which direction yet, but I think this may be the root issue here.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 1, 2018

Well, now I'm more confused than ever! :> Actually, our XML (& its serializations) should always be utf8 (or so I thought); not just with the plane1 conversions, or even for math, but always unless it's in the ASCII subset. So, I'm really confused about what binmode is actually for; I'd thought it just prepared the stream for receiving utf8; if we give it utf8 then no encoding necessary. OTOH, if I take your interpretation, we probably should never use binmode? I don't understand the 2nd comment above about "decoded by latexml for internal use"; if latexml reads in xml, it should also be utf8, right?

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 1, 2018

Ah, I think the issue is that XML::LibXML::Document->toString converts to bytes, not chars, according to the document's encoding, so it's already been encoded (to utf8 in our case). (ugh) This is contrary to other XML toString methods, and the reasoning is apparently so that one can simply use print $string.

That hardly seems desirable, but in any case, we need to be clearer about when we have bytes vs chars.

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

I don't think that's just it, have to rebuild my Perl installation right now (because of fun reasons with version compatibility) and will then prep a minimal example for my theory.

My current understanding is that the binmode(STDOUT, ":encoding(UTF-8)") really means explicitly encode to unicode before printing to the stream, which is what you want to do if you are working with unicode strings that have been decoded into Perl's internal form - those can not be printed directly without an encoding pass, or will come out garbled.

It may be the case that the exact set of steps that produce the bold math unicode symbols run an explicit encode call already when depositing the string to be absorbed in the Document, which would then lead to a double-encode error on print to STDOUT

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

If we were to only "mark" the output as unicode, it seems the correct way would be to specify :utf8 instead of the :encoding(UTF-8) as the docs explain

Not getting any simpler, but maybe getting closer to understandable...

EDIT: Just tried the :utf8 variant and it is the same effect as the :encoding one. Removing the binmode gets the bold 0 symbol to print correctly.

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Also, not specific to latexmlmath, can reproduce the garbled output in latexmlc via:

latexmlc --format=html5 --whatsin=fragment "literal:$\displaystyle\boldsymbol{0}$"  --preload=amsmath
@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 1, 2018

Don't get sidetracked with plane1 or even fancy math; you'll get the same problem with \pi

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Actually, now that I try plain text, all unicode is coming out double-encoded to STDOUT. Huh.

File output still works reliably.

And I verified this is seen both when reading from STDIN or a file.

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Ok, so I now retried with the exact example I fixed in #918 since I am starting to go crazy.

I can confirm my PR fixed the exact error reported in that issue, one of:

processing started Fri Feb  2 00:28:23 2018:12: parser error : Input is not proper UTF-8, indicate encoding !

And can also confirm that the argument-free calls of latexml test.tex and latexmlc test.tex print correct, valid unicode on STDOUT.

However, I think post-processing is somehow not as robust as the core processing for the encoding print to stdout. Modifying the command to latexmlc --format=html5 test.tex leads to the unicode getting double-encoded. So this gives me something to diagnose with.

Pretty certain it is not executable-specific as it can be seen in both latexmlc and latexmlmath, and now it seems clear it relates to the HTML5 post-processing toolchain

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 1, 2018

Wow, so all of the cases that looked correct on STDOUT were indeed printed not by libxml's serializer but by your serialize_aux in Document.pm !

Ok...

Then maybe there is a solution where we never use binmode at all, but instead add an explicit encode to UTF-8 to the result of the serialize_aux call? Quite the ball of yarn to untangle

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 2, 2018

Actually, not so hard, I think.

For the record, an old, but complete, if painful, summary of issues with Unicode (not even necessarily restricted to Perl) https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default. The relevant takeaway is that all communication with the outside world is via bytes, not characters.

Most likely the confusing thing here is that there are 3 distinct "Document" classes in play, (LaTeXML::Core, LaTeXML::Post & XML::LibXML) that each have a toString method. Only the latter's toString "helpfully" does the encoding for you; the serialization is bytes not characters.

Note that we pretty consistently only use STDOUT (and files we've opened ourselves) for outputting "real" data, and that is done from relatively few places. So I think the key is that whenever we output real data, we want to be concious of, and explicit about, whether it is in character form, or already encoded bytes; if the former, we encode. In fact, I'd rather not even risk the "sticky" binmode encoding layer, and just explicitly encode the serialization, at the point where it's output, if & when needed.

[Aside: STDERR is for less critical, informative, messages. And since in the program generally, we want to deal with characters, not bytes, it's reasonable and convenient to use binmode to set STDERR to utf8]

So, that's what I've implemented and it seems to work. Please verify. FWIW: your 1st commit for #918 was fine, but the 2nd commit went too far.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 2, 2018

Oh, @dginev: I wanted to point out that there's one case of binmode on STDOUT in latexmlc that I wasn't sure about. I didn't want to track down all the possibiilities that $result might be :> The binmode is conditional on $is_archive, but maybe a more explicit $needs_encode might be appropriate?

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 3, 2018

Cool, this works with the examples, and indeed latexmlc needs a fix for stdout. Making a PR.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 3, 2018

OK, hopefully the encoding is dealt with correctly now. Thanks all;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment