Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some invalid HTML documents yield Failure("require_current_element: None") #67

Open
Tchou opened this issue Nov 25, 2021 · 3 comments
Open

Comments

@Tchou
Copy link

Tchou commented Nov 25, 2021

The following document (reduced as much as possible) makes parse_html raise a failure "require_current_element: None". Note that the document is indeed invalid since xmp elements are both deprecated and should not appear inside a td element but still one should be able to perform error recovery. I'm guessing the failure exception should not leak to the client ?

<!DOCTYPE html>
<html>
<head><title></title></head>
<body>
    <table>
       <tr>
         <td>
           <xmp></xmp>
         </td>
       </tr>
    </table>

    <table >
      <tr>
        <td></td>
      </tr>
    </table>
    <code></code>
</body>

</html>

I tested it with markup 1.0.2, using the following program :

open Markup

let stream, _close = file "test.html"

let  parser =  parse_html 
        ~report:(fun _ _ -> ())
        ~context:`Document stream
    
let _ = iter ignore (signals parser)
@aantron
Copy link
Owner

aantron commented Nov 25, 2021

I haven't looked into the error in detail yet, but, indeed, exceptions should not leak to the library user, and the parser should be able to recover at least somehow from all errors.

@aantron
Copy link
Owner

aantron commented Nov 25, 2021

You may be able to follow along these rules in the HTML5 spec to get an idea of what the parser is required by the spec to do (that's what I'll do when debugging). The input is quite small (thank you!), so it should be doable. If you can spot the point at which the parser diverges, that would probably be the bug.

The error makes me suspect that this is an interaction with the "adoption agency algorithm", called on by the spec in various places. It is used to reorganize the document tree in response to various misnested elements. Markup.ml creates local subtrees to "buffer" parser stream output in case those subtrees might be affected by the "adoption agency algorithm," and streams those subtrees out on its final output stream once it is sure that the subtree is safe from modification by the algorithm.

@aantron
Copy link
Owner

aantron commented Nov 25, 2021

If you do try to follow the spec, and run into the "foster parenting algorithm," which is required in misnesting inside <table>, Markup.ml does not implement that algorithm per the conformance status. This is because the "foster parenting algorithm" can require very remote modifications to the document tree being parsed, which, in a streaming parser, means buffering a potentially huge amount of signals before emitting them — defeating the point of streaming.

As an aside, it is possible to implement the algorithm as a separate pass over the signal stream for people that need this level of conformance (at the price of buffering the document), but I don't think that's ever been done for Markup.ml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants