Additional parsing for scripts #1093

eugbaranov · 2017-10-19T20:36:42Z

Whilst investigating original issue I've found a few more cases when parser would mistake special characters for end tag sequence.

ovatsus · 2017-10-21T16:25:17Z

src/Html/HtmlParser.fs

+            | _ -> state.Cons(); scriptDoubleQuoteString state
+        and scriptSlash state =
+            match state.Peek() with
+            | '/' -> state.Cons(); scriptSinleLineComment state


Typo: sinle instead of single

ovatsus · 2017-10-21T16:28:02Z

tests/FSharp.Data.Tests/HtmlParser.fs

@@ -96,6 +96,18 @@ let ``Can handle attributes with no value``() =
        ]
    node.Attributes() |> should equal expected

+[<TestCase("var r =\"</script>\"")>]
+[<TestCase("var r ='</script>'")>]
+[<TestCase("var r =/</g")>]


Can you add a test case with "var r =/\/</g"

ovatsus · 2017-10-21T16:29:16Z

tests/FSharp.Data.Tests/HtmlParser.fs

+[<TestCase("var r =\"</script>\"")>]
+[<TestCase("var r ='</script>'")>]
+[<TestCase("var r =/</g")>]
+[<TestCase("//</script>\n")>]


Can you add the opposite case, not having the \n at the end to make sure it fails

ovatsus · 2017-10-21T16:29:41Z

tests/FSharp.Data.Tests/HtmlParser.fs

+[<TestCase("var r =/</g")>]
+[<TestCase("//</script>\n")>]
+[<TestCase("/*</script>*/")>]
+[<TestCase("/*</script>**/")>]


Can you add test cases that use /* */ and span multiple lines?

ovatsus · 2017-10-21T16:30:16Z

/cc @colinbull

…#1093)

ovatsus · 2017-10-24T21:54:48Z

@Regent , is this finished now or are you still looking at more corner cases? @colinbull , can you have a quick look as well?

eugbaranov · 2017-10-25T13:57:41Z

@ovatsus, in coming days I also want to cover the case of parsing <script>0;<body></body> correctly - at the moment it would parse as simply <script /> - but ideally I would like to send a separate PR for that.

ovatsus · 2017-12-03T18:24:10Z

Sorry, I thought I had already merged this before

ovatsus · 2017-12-03T19:30:08Z

Released in 2.4.3

glchapman · 2018-04-22T19:38:16Z

Doing some scraping, I found a case where it appears this change (or at least changes since 2.4.2) do not find the end tag of a script. The url in question is :
https://www.marketwatch.com/investing/stock/msft/analystestimates
(where any stock ticker can replace "msft").

I'm attaching captures from the VS Debugger's text visualization of the script node's contents (Raw View). As you can see, in 2.4.2, the script node is correctly terminated, whereas in 2.4.6, it is not terminated until the closing tag of the ensuing script.

Script_2.4.2.txt
Script_2.4.6.txt

glchapman · 2018-04-23T17:11:48Z

I believe I've narrowed down my problem to the handling of regexes in scripts. As you can see in the attached example, the regex contains a '/' in a bracketed character class. The tokenizer mistakes this for the end delimiter of the regex, and so fails to tokenize correctly.
test_html.txt
Here's fsi output for 2.4.6 with the above test.html:

> let doc = HtmlDocument.Load("test.html");;
val doc : HtmlDocument =
  <html>
  <head />
  <body>
    <script type="text/javascript">
  function(selector){
    return selector.replace(/([/.])/g, '\\$1');
  };
</script>
<div>captured by script</div>
<script></script>
  </body>
</html>

> let script = HtmlDocument.descendantsNamed false ["script"] doc |> Seq.head;;
val script : HtmlNode =
  <script type="text/javascript">
  function(selector){
    return selector.replace(/([/.])/g, '\\$1');
  };
</script>
<div>captured by script</div>
<script></script>

ovatsus · 2018-04-24T02:25:47Z

@glchapman , can you send a PR with a failing test case please? And if you could also fix it that would be great, but if you send the failing tests that's already very helpful

eugbaranov · 2018-04-24T06:01:20Z

I haven't realised that you don't need to escape forward slash when it is inside a group. I can't find a definitive specification on the matter - hoped to get it from www.ecma-international.org but they might have forgotten to pay for the hosting...

glchapman · 2018-04-24T17:08:30Z

I'll try to send a PR with a failing case in a bit, but FWIW my reading of the current HTML standard implies you don't actually have to parse the script text (to handle regexes, etc). Instead, raw text nodes (the content of scripts and styles) can have any text at all up to the terminating tag:
https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#restrictions-on-the-contents-of-raw-text-and-escapable-raw-text-elements
See also the description of parsing the RAWTEXT state:
https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#rawtext-state

I don't know if this differs from earlier HTML variants.

colinbull · 2018-04-29T18:50:08Z

I'm just looking at this now. @glchapman Can you let me push to your PR with the failing test?

glchapman · 2018-04-29T19:08:59Z

@colinbull -- I've added you as a collaborator to my fork with the PR. Is this what you need? (I'm sorry, I'm not a big github user).

colinbull · 2018-04-29T19:53:48Z

Yep thanks. Ignore my comments before I have caught up on whats going on now.

colinbull · 2018-04-29T20:22:53Z

So I'm not sure after browsing through this https://tc39.github.io/ecma262/#sec-regexp-regular-expression-objects that the above is a valid regex. If the intermediate '/' is escaped. then the parser works.

For my knowledge is the original motivation for these changes because the closing script ended up being skipped in some cases?

eugbaranov · 2018-04-30T05:56:10Z

Original motivation was to handle some corner-cases, e.g. so that </ within regex doesn't get parsed as start of an end tag.

But that's a good point - do we need to parse scripts at all. If I read it correctly, HTML spec simply looks for a closing script tag.
One good thing about current implementation is that it's resilient to the lack of closing tag.

colinbull · 2018-04-30T20:03:02Z

Ok, I have the unit tests passing now but have broken the signature tests (ebay_cars.htm). I haven't got anymore time tonight to look but hoping to get back to this, tomorrow or wednesday.

…sprojects#1093, maybe

eugbaranov added 3 commits October 19, 2017 21:08

Add tests for #1091

d651c4f

Add another test for #1091

4c0c2e3

Fix #1091

c7ed72a

ovatsus reviewed Oct 21, 2017

View reviewed changes

eugbaranov added 4 commits October 23, 2017 20:07

Corrected typo for #1091

4e0161b

Defensive checks for end of file (#1091)

238b5b5

Add support for escape character in script regular expressions (#1091, …

3466add

…#1093)

Additional parsing logic and tests for #1091 (#1093)

8bbe8e0

ovatsus merged commit a645e07 into fsprojects:master Dec 3, 2017

glchapman mentioned this pull request Apr 24, 2018

Test html parsing of script with '/' embedded in bracketed character class #1160

Merged

colinbull added a commit to colinbull/FSharp.Data that referenced this pull request Sep 9, 2018

Potential further fix for fsprojects#1091, and tidies regression in f…

f72ecf0

…sprojects#1093, maybe

ovatsus mentioned this pull request Sep 10, 2018

Glchapman master #1204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional parsing for scripts #1093

Additional parsing for scripts #1093

eugbaranov commented Oct 19, 2017

ovatsus Oct 21, 2017

ovatsus Oct 21, 2017 •

edited

ovatsus Oct 21, 2017

ovatsus Oct 21, 2017

ovatsus commented Oct 21, 2017

ovatsus commented Oct 24, 2017

eugbaranov commented Oct 25, 2017

ovatsus commented Dec 3, 2017

ovatsus commented Dec 3, 2017

glchapman commented Apr 22, 2018

glchapman commented Apr 23, 2018

ovatsus commented Apr 24, 2018

eugbaranov commented Apr 24, 2018

glchapman commented Apr 24, 2018

colinbull commented Apr 29, 2018

glchapman commented Apr 29, 2018

colinbull commented Apr 29, 2018

colinbull commented Apr 29, 2018

eugbaranov commented Apr 30, 2018

colinbull commented Apr 30, 2018

Additional parsing for scripts #1093

Additional parsing for scripts #1093

Conversation

eugbaranov commented Oct 19, 2017

ovatsus Oct 21, 2017

Choose a reason for hiding this comment

ovatsus Oct 21, 2017 • edited

Choose a reason for hiding this comment

ovatsus Oct 21, 2017

Choose a reason for hiding this comment

ovatsus Oct 21, 2017

Choose a reason for hiding this comment

ovatsus commented Oct 21, 2017

ovatsus commented Oct 24, 2017

eugbaranov commented Oct 25, 2017

ovatsus commented Dec 3, 2017

ovatsus commented Dec 3, 2017

glchapman commented Apr 22, 2018

glchapman commented Apr 23, 2018

ovatsus commented Apr 24, 2018

eugbaranov commented Apr 24, 2018

glchapman commented Apr 24, 2018

colinbull commented Apr 29, 2018

glchapman commented Apr 29, 2018

colinbull commented Apr 29, 2018

colinbull commented Apr 29, 2018

eugbaranov commented Apr 30, 2018

colinbull commented Apr 30, 2018

ovatsus Oct 21, 2017 •

edited