Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More encoding cleanup #940

Merged
merged 1 commit into from Feb 3, 2018
Merged

Conversation

dginev
Copy link
Collaborator

@dginev dginev commented Feb 3, 2018

Follow-up to #938 .

In fact I think this PR demonstrates why I think leaving things in the current state will cause more trouble (at least for me) down the line when I forget this discussion thread. Having identically named toString methods that return different types of data (unencoded Perl chars vs encoded Unicode bytes) is just awkward.

Then again I realize there are more toString methods out there, and they all use characters internally. So might as well punt with this... I just caught myself again mistakenly double-encoding in LaTeXML.pm (see line 327 in diff), it's just too opaque to easily notice which method is getting called.

@brucemiller
Copy link
Owner

I agree with everything except perhaps the conclusion. I'd argue we want either bytes or characters (virtually) all the time (and currently we're almost always characters).

You seem to argue that toString ought to return bytes consistently. But recall that XML::LibXML has already provided us with 1 toString that returns bytes and many that return characters! Simplest is to isolate that 1 toString method (or not use it?)

@dginev
Copy link
Collaborator Author

dginev commented Feb 3, 2018

Well I would be happy to go that way too, but that's not as simple either, as there is another method that returns bytes after serializing - toStringHTML:
https://github.com/dginev/LaTeXML/blob/ded9f9698c533529f24813ec23a81364e2c435b2/lib/LaTeXML.pm#L324

So it's messy either way... I almost catch myself wishing there was a Serialized class that had a reliable unicode flag, or types, or anything consistently enforcing really. Maybe merging here is as good as we get for now.

@brucemiller
Copy link
Owner

Yeah, OK, so two methods --- actually, I was bunding that with the XML::LibXML::Document->toString method --- and in fact, they're nicely encapsulated in LaTeXML::Post::Writer. Perhaps that's a good model for it?

@brucemiller brucemiller merged commit d8afb80 into brucemiller:master Feb 3, 2018
@dginev
Copy link
Collaborator Author

dginev commented Feb 3, 2018

Writer is nice for the classic local uses, but for LaTeXML.pm it is not possible to use it -- for example in the web services / daemon server uses, where the serialized result is passed back to some middleware which passes it over the wire to the final recipient.

Not seeing any easy wins... If there was a reliable generic ensure_unicode_encoding routine which operated on strings, that would be something. I have had the pleasure of working with such a (home grown) guard in Ruby, which was used a lot in practice. And would be something you could run once before returning the serialization, as a convention, which makes it pretty. But I'm pretty sure this is harder in Perl 5.

@dginev dginev deleted the comments-re-encoding branch February 25, 2018 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants