Skip to content
This repository
Browse code

Fixed bug RF#29521: HTML math output not always XHTML compatible

The characters < and & are not allowed in a script tag in XHTML.
So since the HTML converter uses script tags for math elements,
whenever these characters appear in a value the value is wrapped
in a CDATA section to make the output XHTML compatible.
  • Loading branch information...
commit 512b00a6d050f506c43b824861f7bbf459862a19 1 parent 8f11194
Thomas Leitner authored February 19, 2012
3  lib/kramdown/converter/html.rb
@@ -315,7 +315,8 @@ def convert_smart_quote(el, indent)
315 315
 
316 316
       def convert_math(el, indent)
317 317
         block = (el.options[:category] == :block)
318  
-        "<script type=\"math/tex#{block ? '; mode=display' : ''}\">#{el.value}</script>#{block ? "\n" : ''}"
  318
+        value = (el.value =~ /<|&/ ? "<![CDATA[#{el.value}]]>" : el.value)
  319
+        "<script type=\"math/tex#{block ? '; mode=display' : ''}\">#{value}</script>#{block ? "\n" : ''}"
319 320
       end
320 321
 
321 322
       def convert_abbreviation(el, indent)
2  lib/kramdown/parser/html.rb
@@ -529,7 +529,7 @@ def is_math_tag?(el)
529 529
 
530 530
         def handle_math_tag(el)
531 531
           set_basics(el, :math, :category => (el.attr['type'] =~ /mode=display/ ? :block : :span))
532  
-          el.value = el.children.shift.value
  532
+          el.value = el.children.shift.value.sub(/\A<!\[CDATA\[(.*)\]\]>\z/m, '\1')
533 533
           el.attr.delete('type')
534 534
         end
535 535
 
4  test/testcases/block/15_math/normal.html
@@ -6,10 +6,10 @@
6 6
 <p><script type="math/tex">\lambda_\alpha > 5</script>
7 7
 This is a para.</p>
8 8
 
9  
-<script type="math/tex; mode=display">\begin{align*}
  9
+<script type="math/tex; mode=display"><![CDATA[\begin{align*}
10 10
 &=5 \\
11 11
 &=6 \\
12  
-\end{align*}</script>
  12
+\end{align*}]]></script>
13 13
 
14 14
 <script type="math/tex; mode=display">5+5</script>
15 15
 

0 notes on commit 512b00a

Xi Wang

With this commit, MathJax displays <![CDATA and ]]> around equations, which is incorrect.

To make math output MathJax compatible, I am using the following workaround.

-        "<script type=\"math/tex#{block ? '; mode=display' : ''}\">#{value}</script>#{block ? "\n" : ''}"
+        lb = block ? '\[' : '\('
+        rb = block ? '\]'"\n" : '\)'
+        "#{lb}#{value}#{rb}"
Gioele

I originally reported the problem with unescaped < and & characters. I also think that it would be better to just use () and [] instead of <script>. Anyway, both solutions require either a <![CDATA (that is not liked by MathJax, you say) or, what I prefer, to HTML-escape the contents of value.

Why isn't HTML-escape used instead of <![CDATA?

Please note that not HTML-escaping external data is going to lead to security problems.

Xi Wang

CDATA in \[...\] works for MathJax, but not in the <script> tag.

&amps; in <script> (without CDATA) doesn't work for MathJax, either.

Maybe it's better to get rid of the <script> tag?

BTW, any suggestion on how to do HTML escaping here?

Gioele

I'm all for getting rid of the <script> tag, also because it (probably, haven't tested thoroughly) hides content for browser with JS disabled. I would like to know what is the reason behind that preference by the MathJax authors.

HTML escape can be easily performed with CGI.escapeHTML(value) (CGI is part of the stdlib).

Xi Wang

Actually I tried this patch.

-        value = (el.value =~ /<|&/ ? "<![CDATA[#{el.value}]]>" : el.value)
-        "<script type=\"math/tex#{block ? '; mode=display' : ''}\">#{value}</script>#{block ? "\n" : ''}"
+        value = CGI.escapeHTML(el.value)
+        lb = block ? '\[' : '\('
+        rb = block ? '\]'"\n" : '\)'
+        "#{lb}#{value}#{rb}"

Then every \\ (linebreak) in my latex source was turned into &#92;, which of course didn't work in MathJax. Did I miss anything?

Xi Wang

Oops, it has nothing to do with CGI.escapeHTML. \\ appears in kramdown's output. Maybe Octopress or Jekyll does something magic (I have been using kramdown as the markdown engine in Octopress).

Xi Wang

Here goes a summary. I am using Octopress with MathJax and kramdown.

Methods that work

  • <script> w/ original latex (i.e., before this commit). The problem is that it is not XHTML compatible when the latex source contains & and <.

  • \[...\] w/ CDATA (i.e., the workaround I am using). Not sure how safe this is. Probably we need to make sure there is no ]]> in the latex source. Any other concerns?

Methods that do not work

  • <script> w/ CDATA. MathJax displays <![CDATA and ]]>.

  • <script> w/ escaped HTML. MathJax displays amp; for &amps.

  • \[...\] w/ original latex. The linebreak \\ in latex becomes &#92;. It is also not XHTML compatible.

  • \[...\] w/ escaped HTML. The linebreak \\ in latex becomes &#92;.

Looks like the only method that both works for MathJax and remains XHTML compatible is \[...\] w/ CDATA.

Thomas Leitner

One reason for using the <script> tag is so that converting back from HTML to kramdown works. Using \[ and \] would be much more complicated...

Another way to solve this is by using the following startup hook for MathJax:

MathJax.Hub.Register.StartupHook("TeX Jax Ready",function () {
  MathJax.InputJax.TeX.prefilterHooks.Add(function (data) {
    data.math = data.math.replace(/^\s*<!\[CDATA\[\s*((?:\n|.)*)\s*\]\]>\s*$/m,"$1");
  });
});
Gioele

@xiw, can you file a bug with Octopres/Jekill? They should take care this \\ thing.

I filed the original bug because I would like to use XHTML5 and XML tools, so I need kramdown to generate valid XHTML5 (both well formed and semantically correct, so, no CDATA inside <script>).

Also, can we have an option to choose which delimiters to use for displayed/inline math? For example some may prefer $..$ (mathexchange-like) or $$..$$ (and, maybe, some <span class='math'> to make the conversion back to kramdown easier), others may be fine with <script> + CDATA.

Xi Wang

@gettalong I like your workaround. I'll go for that. Thanks a lot!

@gioele Not sure if it's a Jekyll or Octopress thing.

Adding some option sounds nice. Since I trust the latex source I wrote, using $..$ and $$..$$ would work for me.

Thomas Leitner

@gioele, @xiw: I think I found a one-size-fits-all solution for this problem. Like with CDATA sections in javascript code, we can just comment out the CDATA code for the script itself, i.e. by using LaTeX comments we can hide the CDATA code from MathJax.

Generated MathJax <script> elements would look like this:

<script type="math/tex">% <![CDATA[
\begin{align*}
< &=5 \\
&=6 \\
\end{align*} %]]> </script>

Could you tell me if this would work for both of you?

Xi Wang

Nice trick! Works for me.

BTW, we need a linebreak after <![CDATA[ for inline math as well, right?

<p>How about <script type="math/tex">%<![CDATA[
x < y
%]]></script>?</p>
Thomas Leitner

Yes, because otherwise the LaTeX code would end up in the comment, too.

Please sign in to comment.
Something went wrong with that request. Please try again.