This repository has been archived by the owner. It is now read-only.

script elements aren't always get terminated by </script> sequence #114

Closed
abotalov opened this Issue Jun 10, 2015 · 4 comments

Comments

2 participants
@abotalov
Contributor

abotalov commented Jun 10, 2015

http://www.w3.org/TR/html51/semantics.html item 4.12.1.2 contains interesting (aka weird) rules for closing script elements. In particular, take a look at the following code example from that section:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>

The states used in W3C document to handle this situation are, however, quiet complex. The spec lists 18 states related to script tag contents parsing:
8.2.4.6 Script data state
8.2.4.17 Script data less-than sign state
8.2.4.18 Script data end tag open state
8.2.4.19 Script data end tag name state
8.2.4.20 Script data escape start state
8.2.4.21 Script data escape start dash state
8.2.4.22 Script data escaped state
8.2.4.23 Script data escaped dash state
8.2.4.24 Script data escaped dash dash state
8.2.4.25 Script data escaped less-than sign state
8.2.4.26 Script data escaped end tag open state
8.2.4.27 Script data escaped end tag name state
8.2.4.28 Script data double escape start state
8.2.4.29 Script data double escaped state
8.2.4.30 Script data double escaped dash state
8.2.4.31 Script data double escaped dash dash state
8.2.4.32 Script data double escaped less-than sign state
8.2.4.33 Script data double escape end state

Not sure if Oga should support it or just say that it doesn't support such cases.

@abotalov abotalov changed the title from script tags aren't always get terminated by </script> sequence to script elements aren't always get terminated by </script> sequence Jun 10, 2015

@YorickPeterse

This comment has been minimized.

Owner

YorickPeterse commented Jun 11, 2015

What exactly is the supposed behaviour here? Both Nokogiri and Oga seem to parse this exactly the same:

$ cat test.rb
require 'oga'
require 'nokogiri'

html = <<-EOF
<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>
EOF

oga_doc  = Oga.parse_html(html)
noko_doc = Nokogiri::HTML.fragment(html)

puts <<-EOF
Oga:

#{oga_doc.to_xml}

Nokogiri:

#{noko_doc.to_html}
EOF
$ ruby test.rb
Oga:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>


Nokogiri:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>

I don't see how <!-- despite appearances, this is actually part of the script still! --> is supposed to be part of the first script tag. The first script tag is terminated correctly by the </script>, the occurrence of <!-- <script> has no influence as anything except </script> is valid in a <script> tag.

Chrome and Firefox seems to disagree, but I'm not going to change Oga's lexer rules just to match this behaviour as it's much easier to just say "Everything except </script> is allowed".

@abotalov

This comment has been minimized.

Contributor

abotalov commented Jun 11, 2015

Actually, Nokogiri parses it as said in the spec:

Nokogiri::HTML.fragment(File.read('file.html'))
 => #<Nokogiri::HTML::DocumentFragment:0x822 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x820 name="script" children=[#<Nokogiri::XML::Text:0x81e "\n  var example = 'Consider this string: <!-- <script>';\n  console.log(example);\n</script>\n<!-- despite appearances, this is actually part of the script still! -->\n<script>\n ... // this is the same script block still...\n">]>]>
@YorickPeterse

This comment has been minimized.

Owner

YorickPeterse commented Jun 11, 2015

Looking at the spec I'm not really convinced it makes sense altering Oga's behaviour. In fact, I'd argue that unless somebody knows the HTML spec by heart they'd actually expect Oga's behaviour, not what the spec/Nokogiri state/do.

I'm also not really a fan of altering Oga to match badly explain legacy behaviour, e.g. as per this paragraph:

What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block.

@YorickPeterse YorickPeterse added the HTML label Jun 14, 2015

@YorickPeterse

This comment has been minimized.

Owner

YorickPeterse commented Jun 16, 2015

Having thought about this I'm going to leave things as is for the time being. If this is deemed important enough in the future I'll look into it again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.