Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional parsing for scripts #1093

Merged
merged 7 commits into from Dec 3, 2017
Merged

Additional parsing for scripts #1093

merged 7 commits into from Dec 3, 2017

Conversation

eugbaranov
Copy link
Contributor

Whilst investigating original issue I've found a few more cases when parser would mistake special characters for end tag sequence.

| _ -> state.Cons(); scriptDoubleQuoteString state
and scriptSlash state =
match state.Peek() with
| '/' -> state.Cons(); scriptSinleLineComment state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: sinle instead of single

@@ -96,6 +96,18 @@ let ``Can handle attributes with no value``() =
]
node.Attributes() |> should equal expected

[<TestCase("var r =\"</script>\"")>]
[<TestCase("var r ='</script>'")>]
[<TestCase("var r =/</g")>]
Copy link

@ovatsus ovatsus Oct 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case with "var r =/\/</g"

[<TestCase("var r =\"</script>\"")>]
[<TestCase("var r ='</script>'")>]
[<TestCase("var r =/</g")>]
[<TestCase("//</script>\n")>]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the opposite case, not having the \n at the end to make sure it fails

[<TestCase("var r =/</g")>]
[<TestCase("//</script>\n")>]
[<TestCase("/*</script>*/")>]
[<TestCase("/*</script>**/")>]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test cases that use /* */ and span multiple lines?

@ovatsus
Copy link

ovatsus commented Oct 21, 2017

/cc @colinbull

@ovatsus
Copy link

ovatsus commented Oct 24, 2017

@Regent , is this finished now or are you still looking at more corner cases? @colinbull , can you have a quick look as well?

@eugbaranov
Copy link
Contributor Author

@ovatsus, in coming days I also want to cover the case of parsing <script>0;<body></body> correctly - at the moment it would parse as simply <script /> - but ideally I would like to send a separate PR for that.

@ovatsus
Copy link

ovatsus commented Dec 3, 2017

Sorry, I thought I had already merged this before

@ovatsus ovatsus merged commit a645e07 into fsprojects:master Dec 3, 2017
@ovatsus
Copy link

ovatsus commented Dec 3, 2017

Released in 2.4.3

@glchapman
Copy link

Doing some scraping, I found a case where it appears this change (or at least changes since 2.4.2) do not find the end tag of a script. The url in question is :
https://www.marketwatch.com/investing/stock/msft/analystestimates
(where any stock ticker can replace "msft").

I'm attaching captures from the VS Debugger's text visualization of the script node's contents (Raw View). As you can see, in 2.4.2, the script node is correctly terminated, whereas in 2.4.6, it is not terminated until the closing tag of the ensuing script.

Script_2.4.2.txt
Script_2.4.6.txt

@glchapman
Copy link

I believe I've narrowed down my problem to the handling of regexes in scripts. As you can see in the attached example, the regex contains a '/' in a bracketed character class. The tokenizer mistakes this for the end delimiter of the regex, and so fails to tokenize correctly.
test_html.txt
Here's fsi output for 2.4.6 with the above test.html:

> let doc = HtmlDocument.Load("test.html");;
val doc : HtmlDocument =
  <html>
  <head />
  <body>
    <script type="text/javascript">
  function(selector){
    return selector.replace(/([/.])/g, '\\$1');
  };
</script>
<div>captured by script</div>
<script></script>
  </body>
</html>

> let script = HtmlDocument.descendantsNamed false ["script"] doc |> Seq.head;;
val script : HtmlNode =
  <script type="text/javascript">
  function(selector){
    return selector.replace(/([/.])/g, '\\$1');
  };
</script>
<div>captured by script</div>
<script></script>

@ovatsus
Copy link

ovatsus commented Apr 24, 2018

@glchapman , can you send a PR with a failing test case please? And if you could also fix it that would be great, but if you send the failing tests that's already very helpful

@eugbaranov
Copy link
Contributor Author

I haven't realised that you don't need to escape forward slash when it is inside a group. I can't find a definitive specification on the matter - hoped to get it from www.ecma-international.org but they might have forgotten to pay for the hosting...

@glchapman
Copy link

I'll try to send a PR with a failing case in a bit, but FWIW my reading of the current HTML standard implies you don't actually have to parse the script text (to handle regexes, etc). Instead, raw text nodes (the content of scripts and styles) can have any text at all up to the terminating tag:
https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#restrictions-on-the-contents-of-raw-text-and-escapable-raw-text-elements
See also the description of parsing the RAWTEXT state:
https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#rawtext-state

I don't know if this differs from earlier HTML variants.

@colinbull
Copy link
Contributor

I'm just looking at this now. @glchapman Can you let me push to your PR with the failing test?

@glchapman
Copy link

@colinbull -- I've added you as a collaborator to my fork with the PR. Is this what you need? (I'm sorry, I'm not a big github user).

@colinbull
Copy link
Contributor

Yep thanks. Ignore my comments before I have caught up on whats going on now.

@colinbull
Copy link
Contributor

So I'm not sure after browsing through this https://tc39.github.io/ecma262/#sec-regexp-regular-expression-objects that the above is a valid regex. If the intermediate '/' is escaped. then the parser works.

For my knowledge is the original motivation for these changes because the closing script ended up being skipped in some cases?

@eugbaranov
Copy link
Contributor Author

Original motivation was to handle some corner-cases, e.g. so that </ within regex doesn't get parsed as start of an end tag.

But that's a good point - do we need to parse scripts at all. If I read it correctly, HTML spec simply looks for a closing script tag.
One good thing about current implementation is that it's resilient to the lack of closing tag.

@colinbull
Copy link
Contributor

Ok, I have the unit tests passing now but have broken the signature tests (ebay_cars.htm). I haven't got anymore time tonight to look but hoping to get back to this, tomorrow or wednesday.

colinbull added a commit to colinbull/FSharp.Data that referenced this pull request Sep 9, 2018
@ovatsus ovatsus mentioned this pull request Sep 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants