Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to generating HTML ourselves #99

Closed
wants to merge 11 commits into from
Closed

Switch to generating HTML ourselves #99

wants to merge 11 commits into from

Conversation

bhollis
Copy link
Owner

@bhollis bhollis commented Aug 19, 2013

Switch to generating HTML ourselves, without the help of any XML/HTML libraries. This gives us much better direct control of HTML output, makes it much easier to provide consistency across Rubies, removes exposure to Nokogiri bugs, and furthers the goal of removing Nokogiri as a required dependency of Maruku.

@distler, I'm especially interested in how this works in your application, since you seem to really put Maruku through its paces and can probably find some problems that aren't yet in tests.

@bhollis
Copy link
Owner Author

bhollis commented Aug 21, 2013

Rebased.

@distler
Copy link
Collaborator

distler commented Aug 21, 2013

This will require a lot more checking ... :-)

The blahtex support code has been simplified to no longer cache mathml in an unbounded file-based cache, and it no longer writes tempfiles. The extremely simple XML parsing job has been handed back over to REXML.
@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

I added in some fixes to remove the itex2mml and blahtex Nokogiri dependencies. Blahtex looked like it was pretty broken, so I put in some tests and go it working again.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

And now S5 is cleaned up too.

@distler
Copy link
Collaborator

distler commented Aug 25, 2013

Your "blahtex" output looks like MathML to me.

Maruku 0.6.1 would notice if a single header encompassed the entire document, and would make that the root section of the document, as well as setting the title to that header's content. This restores that functionality.
This change switches to using the HTML output utilities to generate S5 slides. It also restores the STDOUT printing of slide titles
that current Maruku does, but only from the CLI (it won't annoy users who programmatically generate slides).
@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

Thanks, @distler, I'll endeavour to correct my mistakes.

Speaking of which, the TOC generation change affected output, so I rewrote that commit to include modified tests.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

Also, I have no experience with blahtex - I thought the point was for it to output MathML when you use the --mathml flag?

@distler
Copy link
Collaborator

distler commented Aug 25, 2013

The library was designed to do either (output mathml or png's), at the flip of a switch. (It was originally written as a proposed replacement for texvc in MediaWiki. The idea was that Wikipedia could continue generating png's for math, as it currently does and then, at some point in the future, flip a switch and start generating MathML instead. Blahtex never found favour with the MediaWiki people, so development sorta languished.)

In Maruku, Blahtex has mostly been used to output png's (for those who don't want/can't handle MathML).

The PNG output is a little hard to test properly

  • It requires a working TeX installation to process the input.
  • The output involves a bunch of pictures of the equations we're trying to generate.

But that's what's actually used, in practice, so it's what we want to test.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

OK, sure, it's definitely worth testing the PNG output. My changes simply fix the MathML output (when :html_math_output_png has not been turned on).

In other news, this viewBox thing is turning out to be somewhat difficult. Something is weird where Nokogiri::XML::DocumentFragment will lowercase all attributes when parsing MathML that includes SVG. Nokogiri::XML::Document.parse is fine, and parsing just SVG through DocumentFragment is fine. It smells like this bug: sparklemotion/nokogiri#455 - I wonder if "HTML mode" is getting triggered somehow.

@distler
Copy link
Collaborator

distler commented Aug 25, 2013

Something is weird where Nokogiri::XML::DocumentFragmentwill lowercase all attributes when parsing MathML that includes SVG.

I assume this is a problem with the JRuby version? I have never seen that under MRI.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

Nope, this is 1.6.0 under Ruby 2.0.0. Check it out: sparklemotion/nokogiri#961

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

I think I'm too tired to keep hammering on this tonight. I've gotten it so it either preserves case on that attribute, or preserves math entities in the document, but not both. Even when I switch HTMLFragment to use the same exact code for parsing the document that the itex2mml module used to. Maybe this'll make sense later.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

Hmm, now I see the problem - I confused to_utf8 from itex_stringsupport with as_utf8. I'm not exactly sure I understand what to_utf8 is doing though - what library adds convert_to_utf8 to String? That seems to be the key to preventing Nokogiri's freakout.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

That's not even entirely it - adding to_utf8 fixes my reduced repro case, but not the full test.

@distler
Copy link
Collaborator

distler commented Aug 25, 2013

#to_utf8 converts MathML named entitites to the corresponding utf8 characters.
Nokogiri (and any other legitimate XML parser) would freak out if presented with named entities without the DTD which defines them.

Its partner is #to_ncr which converts named entities to numeric character references.

@bhollis
Copy link
Owner Author

bhollis commented Aug 25, 2013

That makes sense. Do you know where convert_to_utf8 is defined? I'd like to have that code available within Maruku without relying on itextomml.

@distler
Copy link
Collaborator

distler commented Aug 25, 2013

Do you know where convert_to_utf8 is defined?

Sure (since I wrote it). It's in lib/itex_stringsupport.rb in the gem. Or you can go to the source code repository to find it.

…cludes a doctype.

This preserves both the case of attributes and any named entity references from MathML.
This also extends the tests for issue #40 and squashes a bunch of bugs around CDATA, mostly using hacks.
@bhollis
Copy link
Owner Author

bhollis commented Aug 31, 2013

OK, I think I have this thing licked.

After switching to parsing fragments as a full document with a DOCTYPE, we preserve named entites and the case of attributes. itex_stringsupport isn't even needed, but I'm leaving it in because without it, math prints as named entities rather than as Unicode characters.

What do you think?

@distler
Copy link
Collaborator

distler commented Aug 31, 2013

Maruku's output should definitely use either unicode characters or NCRs (rather than named entities).

@distler
Copy link
Collaborator

distler commented Aug 31, 2013

Also, I'm surprised that the presence or absence of a DOCTYPE declaration makes a difference.

@bhollis
Copy link
Owner Author

bhollis commented Aug 31, 2013

Yeah, it's a pretty weird way to fix it. Let me know what it looks like running on your documents and I'll merge.

@bhollis
Copy link
Owner Author

bhollis commented Sep 2, 2013

No hurry, but let me know when this is good to merge.

@bhollis
Copy link
Owner Author

bhollis commented Sep 7, 2013

Manually merged.

@bhollis bhollis closed this Sep 7, 2013
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5ce7c41 on string-html into * on master*.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants