Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script elements aren't always get terminated by </script> sequence #114

Closed
abotalov opened this issue Jun 10, 2015 · 4 comments
Closed

script elements aren't always get terminated by </script> sequence #114

abotalov opened this issue Jun 10, 2015 · 4 comments
Labels

Comments

@abotalov
Copy link
Contributor

http://www.w3.org/TR/html51/semantics.html item 4.12.1.2 contains interesting (aka weird) rules for closing script elements. In particular, take a look at the following code example from that section:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>

The states used in W3C document to handle this situation are, however, quiet complex. The spec lists 18 states related to script tag contents parsing:
8.2.4.6 Script data state
8.2.4.17 Script data less-than sign state
8.2.4.18 Script data end tag open state
8.2.4.19 Script data end tag name state
8.2.4.20 Script data escape start state
8.2.4.21 Script data escape start dash state
8.2.4.22 Script data escaped state
8.2.4.23 Script data escaped dash state
8.2.4.24 Script data escaped dash dash state
8.2.4.25 Script data escaped less-than sign state
8.2.4.26 Script data escaped end tag open state
8.2.4.27 Script data escaped end tag name state
8.2.4.28 Script data double escape start state
8.2.4.29 Script data double escaped state
8.2.4.30 Script data double escaped dash state
8.2.4.31 Script data double escaped dash dash state
8.2.4.32 Script data double escaped less-than sign state
8.2.4.33 Script data double escape end state

Not sure if Oga should support it or just say that it doesn't support such cases.

@abotalov abotalov changed the title script tags aren't always get terminated by </script> sequence script elements aren't always get terminated by </script> sequence Jun 10, 2015
@yorickpeterse
Copy link
Owner

What exactly is the supposed behaviour here? Both Nokogiri and Oga seem to parse this exactly the same:

$ cat test.rb
require 'oga'
require 'nokogiri'

html = <<-EOF
<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>
EOF

oga_doc  = Oga.parse_html(html)
noko_doc = Nokogiri::HTML.fragment(html)

puts <<-EOF
Oga:

#{oga_doc.to_xml}

Nokogiri:

#{noko_doc.to_html}
EOF
$ ruby test.rb
Oga:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>


Nokogiri:

<script>
  var example = 'Consider this string: <!-- <script>';
  console.log(example);
</script>
<!-- despite appearances, this is actually part of the script still! -->
<script>
 ... // this is the same script block still...
</script>

I don't see how <!-- despite appearances, this is actually part of the script still! --> is supposed to be part of the first script tag. The first script tag is terminated correctly by the </script>, the occurrence of <!-- <script> has no influence as anything except </script> is valid in a <script> tag.

Chrome and Firefox seems to disagree, but I'm not going to change Oga's lexer rules just to match this behaviour as it's much easier to just say "Everything except </script> is allowed".

@abotalov
Copy link
Contributor Author

Actually, Nokogiri parses it as said in the spec:

Nokogiri::HTML.fragment(File.read('file.html'))
 => #<Nokogiri::HTML::DocumentFragment:0x822 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x820 name="script" children=[#<Nokogiri::XML::Text:0x81e "\n  var example = 'Consider this string: <!-- <script>';\n  console.log(example);\n</script>\n<!-- despite appearances, this is actually part of the script still! -->\n<script>\n ... // this is the same script block still...\n">]>]>

@yorickpeterse
Copy link
Owner

Looking at the spec I'm not really convinced it makes sense altering Oga's behaviour. In fact, I'd argue that unless somebody knows the HTML spec by heart they'd actually expect Oga's behaviour, not what the spec/Nokogiri state/do.

I'm also not really a fan of altering Oga to match badly explain legacy behaviour, e.g. as per this paragraph:

What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block.

@yorickpeterse
Copy link
Owner

Having thought about this I'm going to leave things as is for the time being. If this is deemed important enough in the future I'll look into it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants