script elements aren't always get terminated by </script> sequence #114

abotalov · 2015-06-10T21:32:21Z

http://www.w3.org/TR/html51/semantics.html item 4.12.1.2 contains interesting (aka weird) rules for closing script elements. In particular, take a look at the following code example from that section:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>

The states used in W3C document to handle this situation are, however, quiet complex. The spec lists 18 states related to script tag contents parsing:
8.2.4.6 Script data state
8.2.4.17 Script data less-than sign state
8.2.4.18 Script data end tag open state
8.2.4.19 Script data end tag name state
8.2.4.20 Script data escape start state
8.2.4.21 Script data escape start dash state
8.2.4.22 Script data escaped state
8.2.4.23 Script data escaped dash state
8.2.4.24 Script data escaped dash dash state
8.2.4.25 Script data escaped less-than sign state
8.2.4.26 Script data escaped end tag open state
8.2.4.27 Script data escaped end tag name state
8.2.4.28 Script data double escape start state
8.2.4.29 Script data double escaped state
8.2.4.30 Script data double escaped dash state
8.2.4.31 Script data double escaped dash dash state
8.2.4.32 Script data double escaped less-than sign state
8.2.4.33 Script data double escape end state

Not sure if Oga should support it or just say that it doesn't support such cases.

yorickpeterse · 2015-06-11T04:05:05Z

What exactly is the supposed behaviour here? Both Nokogiri and Oga seem to parse this exactly the same:

$ cat test.rb
require 'oga'
require 'nokogiri'

html = <<-EOF
<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>
EOF

oga_doc  = Oga.parse_html(html)
noko_doc = Nokogiri::HTML.fragment(html)

puts <<-EOF
Oga:

#{oga_doc.to_xml}

Nokogiri:

#{noko_doc.to_html}
EOF
$ ruby test.rb
Oga:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>


Nokogiri:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>

I don't see how  is supposed to be part of the first script tag. The first script tag is terminated correctly by the </script>, the occurrence of <!-- <script> has no influence as anything except </script> is valid in a <script> tag.

Chrome and Firefox seems to disagree, but I'm not going to change Oga's lexer rules just to match this behaviour as it's much easier to just say "Everything except </script> is allowed".

abotalov · 2015-06-11T05:39:03Z

Actually, Nokogiri parses it as said in the spec:

Nokogiri::HTML.fragment(File.read('file.html'))
 => #<Nokogiri::HTML::DocumentFragment:0x822 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x820 name="script" children=[#<Nokogiri::XML::Text:0x81e "\n  var example = 'Consider this string: <!-- <script>';\n  console.log(example);\n</script>\n<!-- despite appearances, this is actually part of the script still! -->\n<script>\n ... // this is the same script block still...\n">]>]>

yorickpeterse · 2015-06-11T05:56:08Z

Looking at the spec I'm not really convinced it makes sense altering Oga's behaviour. In fact, I'd argue that unless somebody knows the HTML spec by heart they'd actually expect Oga's behaviour, not what the spec/Nokogiri state/do.

I'm also not really a fan of altering Oga to match badly explain legacy behaviour, e.g. as per this paragraph:

What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block.

yorickpeterse · 2015-06-16T20:21:46Z

Having thought about this I'm going to leave things as is for the time being. If this is deemed important enough in the future I'll look into it again.

abotalov changed the title ~~script tags aren't always get terminated by </script> sequence~~ script elements aren't always get terminated by </script> sequence Jun 10, 2015

yorickpeterse added the HTML label Jun 14, 2015

yorickpeterse closed this as completed Jun 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

script elements aren't always get terminated by </script> sequence #114

script elements aren't always get terminated by </script> sequence #114

abotalov commented Jun 10, 2015

yorickpeterse commented Jun 11, 2015

abotalov commented Jun 11, 2015

yorickpeterse commented Jun 11, 2015

yorickpeterse commented Jun 16, 2015

script elements aren't always get terminated by </script> sequence #114

script elements aren't always get terminated by </script> sequence #114

Comments

abotalov commented Jun 10, 2015

yorickpeterse commented Jun 11, 2015

abotalov commented Jun 11, 2015

yorickpeterse commented Jun 11, 2015

yorickpeterse commented Jun 16, 2015