Switch to generating HTML ourselves #99

bhollis · 2013-08-19T07:34:38Z

Switch to generating HTML ourselves, without the help of any XML/HTML libraries. This gives us much better direct control of HTML output, makes it much easier to provide consistency across Rubies, removes exposure to Nokogiri bugs, and furthers the goal of removing Nokogiri as a required dependency of Maruku.

@distler, I'm especially interested in how this works in your application, since you seem to really put Maruku through its paces and can probably find some problems that aren't yet in tests.

bhollis · 2013-08-21T05:58:02Z

Rebased.

distler · 2013-08-21T06:01:46Z

This will require a lot more checking ... :-)

… library.

…dules.

The blahtex support code has been simplified to no longer cache mathml in an unbounded file-based cache, and it no longer writes tempfiles. The extremely simple XML parsing job has been handed back over to REXML.

bhollis · 2013-08-25T05:21:18Z

I added in some fixes to remove the itex2mml and blahtex Nokogiri dependencies. Blahtex looked like it was pretty broken, so I put in some tests and go it working again.

bhollis · 2013-08-25T05:27:00Z

And now S5 is cleaned up too.

distler · 2013-08-25T05:54:17Z

Your "blahtex" output looks like MathML to me.

Maruku 0.6.1 would notice if a single header encompassed the entire document, and would make that the root section of the document, as well as setting the title to that header's content. This restores that functionality.

This change switches to using the HTML output utilities to generate S5 slides. It also restores the STDOUT printing of slide titles that current Maruku does, but only from the CLI (it won't annoy users who programmatically generate slides).

bhollis · 2013-08-25T06:16:12Z

Thanks, @distler, I'll endeavour to correct my mistakes.

Speaking of which, the TOC generation change affected output, so I rewrote that commit to include modified tests.

bhollis · 2013-08-25T06:16:52Z

Also, I have no experience with blahtex - I thought the point was for it to output MathML when you use the --mathml flag?

distler · 2013-08-25T06:40:11Z

The library was designed to do either (output mathml or png's), at the flip of a switch. (It was originally written as a proposed replacement for texvc in MediaWiki. The idea was that Wikipedia could continue generating png's for math, as it currently does and then, at some point in the future, flip a switch and start generating MathML instead. Blahtex never found favour with the MediaWiki people, so development sorta languished.)

In Maruku, Blahtex has mostly been used to output png's (for those who don't want/can't handle MathML).

The PNG output is a little hard to test properly

It requires a working TeX installation to process the input.
The output involves a bunch of pictures of the equations we're trying to generate.

But that's what's actually used, in practice, so it's what we want to test.

bhollis · 2013-08-25T07:00:36Z

OK, sure, it's definitely worth testing the PNG output. My changes simply fix the MathML output (when :html_math_output_png has not been turned on).

In other news, this viewBox thing is turning out to be somewhat difficult. Something is weird where Nokogiri::XML::DocumentFragment will lowercase all attributes when parsing MathML that includes SVG. Nokogiri::XML::Document.parse is fine, and parsing just SVG through DocumentFragment is fine. It smells like this bug: sparklemotion/nokogiri#455 - I wonder if "HTML mode" is getting triggered somehow.

distler · 2013-08-25T07:08:28Z

Something is weird where Nokogiri::XML::DocumentFragmentwill lowercase all attributes when parsing MathML that includes SVG.

I assume this is a problem with the JRuby version? I have never seen that under MRI.

bhollis · 2013-08-25T07:42:08Z

Nope, this is 1.6.0 under Ruby 2.0.0. Check it out: sparklemotion/nokogiri#961

…is actually broken.

bhollis · 2013-08-25T08:05:21Z

I think I'm too tired to keep hammering on this tonight. I've gotten it so it either preserves case on that attribute, or preserves math entities in the document, but not both. Even when I switch HTMLFragment to use the same exact code for parsing the document that the itex2mml module used to. Maybe this'll make sense later.

bhollis · 2013-08-25T08:30:55Z

Hmm, now I see the problem - I confused to_utf8 from itex_stringsupport with as_utf8. I'm not exactly sure I understand what to_utf8 is doing though - what library adds convert_to_utf8 to String? That seems to be the key to preventing Nokogiri's freakout.

bhollis · 2013-08-25T08:53:32Z

That's not even entirely it - adding to_utf8 fixes my reduced repro case, but not the full test.

distler · 2013-08-25T14:35:08Z

#to_utf8 converts MathML named entitites to the corresponding utf8 characters.
Nokogiri (and any other legitimate XML parser) would freak out if presented with named entities without the DTD which defines them.

Its partner is #to_ncr which converts named entities to numeric character references.

bhollis · 2013-08-25T18:37:56Z

That makes sense. Do you know where convert_to_utf8 is defined? I'd like to have that code available within Maruku without relying on itextomml.

distler · 2013-08-25T18:42:13Z

Do you know where convert_to_utf8 is defined?

Sure (since I wrote it). It's in lib/itex_stringsupport.rb in the gem. Or you can go to the source code repository to find it.

…cludes a doctype. This preserves both the case of attributes and any named entity references from MathML.

This also extends the tests for issue #40 and squashes a bunch of bugs around CDATA, mostly using hacks.

bhollis · 2013-08-31T06:53:22Z

OK, I think I have this thing licked.

After switching to parsing fragments as a full document with a DOCTYPE, we preserve named entites and the case of attributes. itex_stringsupport isn't even needed, but I'm leaving it in because without it, math prints as named entities rather than as Unicode characters.

What do you think?

distler · 2013-08-31T07:03:06Z

Maruku's output should definitely use either unicode characters or NCRs (rather than named entities).

distler · 2013-08-31T07:28:36Z

Also, I'm surprised that the presence or absence of a DOCTYPE declaration makes a difference.

bhollis · 2013-08-31T20:36:06Z

Yeah, it's a pretty weird way to fix it. Let me know what it looks like running on your documents and I'll merge.

bhollis · 2013-09-02T02:37:17Z

No hurry, but let me know when this is good to merge.

bhollis · 2013-09-07T04:46:00Z

Manually merged.

coveralls · 2014-11-11T12:07:54Z

Changes Unknown when pulling 5ce7c41 on string-html into * on master*.

bhollis added 4 commits August 24, 2013 15:52

Switch to generating HTML ourselves, without the help of any XML/HTML…

950b018

… library.

Changes required to support removing nokogiri dependency from math mo…

e939216

…dules.

Rework blahtex support to no longer use Nokogiri or PStore cache.

a0ee775

The blahtex support code has been simplified to no longer cache mathml in an unbounded file-based cache, and it no longer writes tempfiles. The extremely simple XML parsing job has been handed back over to REXML.

Remove explicit Nokogiri dependency for itex2mml module.

e258dcb

bhollis added 2 commits August 24, 2013 23:13

Fix TOC generation to match older Maruku.

773109c

Maruku 0.6.1 would notice if a single header encompassed the entire document, and would make that the root section of the document, as well as setting the title to that header's content. This restores that functionality.

Return case-sensitivity to these math/SVG tests to reflect that this …

9147ad6

…is actually broken.

bhollis added 2 commits August 30, 2013 23:50

Switch to parsing XML/HTML fragments using a wrapper document that in…

ebc1a8a

…cludes a doctype. This preserves both the case of attributes and any named entity references from MathML.

Fix CDATA shenanigans that appeared after the previous commit.

2348546

This also extends the tests for issue #40 and squashes a bunch of bugs around CDATA, mostly using hacks.

Remove TODO

365a989

Return to obfuscating emails with character entities.

5ce7c41

bhollis closed this Sep 7, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to generating HTML ourselves #99

Switch to generating HTML ourselves #99

bhollis commented Aug 19, 2013

bhollis commented Aug 21, 2013

distler commented Aug 21, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 31, 2013

distler commented Aug 31, 2013

distler commented Aug 31, 2013

bhollis commented Aug 31, 2013

bhollis commented Sep 2, 2013

bhollis commented Sep 7, 2013

coveralls commented Nov 11, 2014

Switch to generating HTML ourselves #99

Switch to generating HTML ourselves #99

Conversation

bhollis commented Aug 19, 2013

bhollis commented Aug 21, 2013

distler commented Aug 21, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 25, 2013

distler commented Aug 25, 2013

bhollis commented Aug 31, 2013

distler commented Aug 31, 2013

distler commented Aug 31, 2013

bhollis commented Aug 31, 2013

bhollis commented Sep 2, 2013

bhollis commented Sep 7, 2013

coveralls commented Nov 11, 2014