Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two more wrongly encoded cases in LaTeXML.pm #949

Merged
merged 1 commit into from Feb 25, 2018

Conversation

dginev
Copy link
Collaborator

@dginev dginev commented Feb 7, 2018

While investigating #948 I spotted a broken bold 0 encoding when printing to stdout from latexmlc.

Turns out there were two cases that needed new patches

  • One was the fatal error case for core processing, which was missing the encode call
  • The other was the XML format case in post-processing, which was leveraging the duck-typing of all document objects (i.e. all of them having toString) to return differently encoded values -- since exactly one of our toString calls isn't returning bytes. So I added an explicit check there.

Keeps feeling awkward... I still don't care enough to raise a larger stink about refactoring :> If I hit this once again I might.

@dginev
Copy link
Collaborator Author

dginev commented Feb 7, 2018

One way of seeing this may be that every use of the serialize_aux-based toString requires/expects encoded unicode. Which may be an indication that the encode call could be pushed more tightly into the actual method?

@brucemiller
Copy link
Owner

Actually that's not true. I should have caught that in your comments, but serialize_aux is internal to implementing LaTeXML::Core::Document->toString so (virtually) "nobody" uses it directly, but everybody uses toString and they expect characters. (should correct the comments). As I said elsewhere, only XML::LibXML::Document->toString (and toStringHMTL) return bytes and those really should be locked away to prevent confusion

@dginev
Copy link
Collaborator Author

dginev commented Feb 7, 2018

Ah, but I phrased my claim extremely carefully this time.

All toString calls for the final output of the created Document by latexml, be it Core, Post, or some fragment in the --whatsout=fragment route, are currently expected to return bytes. All classic executables, as well as LaTeXML.pm, do a print assuming the payload is already in bytes.

This is the entire realm I was quantifying over when I said every, I was being quite careful.

The internal toString methods are not something i have studied closely recently, but I do recall the main workhorse internally tend to be the ToString and Stringify global functions instead.

Anyway, I think we are saying the same thing from 2 perspectives. Maybe renaming all Document methods away from toString into a serialize or serializeFinal call, that is always bytes, and mandating all toString calls return chars, would be the right refactor? And we refactor/rename away any uses of the toString calls in the executables and Writer, to be sure.

@brucemiller
Copy link
Owner

Oh! I forgot to merge... is this one still needed?

@dginev
Copy link
Collaborator Author

dginev commented Feb 25, 2018

It is still needed, yes.

I thought you left it open so that I rework it so that there is a specially named method that always encodes to Unicode, but if you have decided that is a long-term goal, merging here is still a good choice. Fixes a problem.

@brucemiller
Copy link
Owner

Ah, not really; a nice safe general approach will be good, but I'm not waiting.

@brucemiller brucemiller merged commit 1b8e67b into brucemiller:master Feb 25, 2018
@dginev dginev deleted the stdout-pmml-latexmlc branch February 25, 2018 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants