New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional parsing for scripts #1093
Conversation
src/Html/HtmlParser.fs
Outdated
| _ -> state.Cons(); scriptDoubleQuoteString state | ||
and scriptSlash state = | ||
match state.Peek() with | ||
| '/' -> state.Cons(); scriptSinleLineComment state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: sinle instead of single
@@ -96,6 +96,18 @@ let ``Can handle attributes with no value``() = | |||
] | |||
node.Attributes() |> should equal expected | |||
|
|||
[<TestCase("var r =\"</script>\"")>] | |||
[<TestCase("var r ='</script>'")>] | |||
[<TestCase("var r =/</g")>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case with "var r =/\/</g"
[<TestCase("var r =\"</script>\"")>] | ||
[<TestCase("var r ='</script>'")>] | ||
[<TestCase("var r =/</g")>] | ||
[<TestCase("//</script>\n")>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the opposite case, not having the \n at the end to make sure it fails
[<TestCase("var r =/</g")>] | ||
[<TestCase("//</script>\n")>] | ||
[<TestCase("/*</script>*/")>] | ||
[<TestCase("/*</script>**/")>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add test cases that use /* */
and span multiple lines?
/cc @colinbull |
@Regent , is this finished now or are you still looking at more corner cases? @colinbull , can you have a quick look as well? |
@ovatsus, in coming days I also want to cover the case of parsing |
Sorry, I thought I had already merged this before |
Released in 2.4.3 |
Doing some scraping, I found a case where it appears this change (or at least changes since 2.4.2) do not find the end tag of a script. The url in question is : I'm attaching captures from the VS Debugger's text visualization of the script node's contents (Raw View). As you can see, in 2.4.2, the script node is correctly terminated, whereas in 2.4.6, it is not terminated until the closing tag of the ensuing script. |
I believe I've narrowed down my problem to the handling of regexes in scripts. As you can see in the attached example, the regex contains a '/' in a bracketed character class. The tokenizer mistakes this for the end delimiter of the regex, and so fails to tokenize correctly.
|
@glchapman , can you send a PR with a failing test case please? And if you could also fix it that would be great, but if you send the failing tests that's already very helpful |
I haven't realised that you don't need to escape forward slash when it is inside a group. I can't find a definitive specification on the matter - hoped to get it from www.ecma-international.org but they might have forgotten to pay for the hosting... |
I'll try to send a PR with a failing case in a bit, but FWIW my reading of the current HTML standard implies you don't actually have to parse the script text (to handle regexes, etc). Instead, raw text nodes (the content of scripts and styles) can have any text at all up to the terminating tag: I don't know if this differs from earlier HTML variants. |
I'm just looking at this now. @glchapman Can you let me push to your PR with the failing test? |
@colinbull -- I've added you as a collaborator to my fork with the PR. Is this what you need? (I'm sorry, I'm not a big github user). |
Yep thanks. Ignore my comments before I have caught up on whats going on now. |
So I'm not sure after browsing through this https://tc39.github.io/ecma262/#sec-regexp-regular-expression-objects that the above is a valid regex. If the intermediate '/' is escaped. then the parser works. For my knowledge is the original motivation for these changes because the closing |
Original motivation was to handle some corner-cases, e.g. so that </ within regex doesn't get parsed as start of an end tag. But that's a good point - do we need to parse scripts at all. If I read it correctly, HTML spec simply looks for a closing script tag. |
Ok, I have the unit tests passing now but have broken the signature tests (ebay_cars.htm). I haven't got anymore time tonight to look but hoping to get back to this, tomorrow or wednesday. |
Whilst investigating original issue I've found a few more cases when parser would mistake special characters for end tag sequence.