Skip to content

Step 2: HTML tree builder — WHATWG tree construction #21

@thomasnemer

Description

@thomasnemer

Parent: #19

Goal

Implement the WHATWG tree construction algorithm that consumes tokens from the tokenizer (Step 1) and builds a DOM tree (ie-dom). The tree builder is a state machine with ~20 insertion modes that handles the complex rules of HTML nesting, error recovery, and implicit element creation.

Prerequisites

File Changes

  • crates/ie-html/src/tree_builder.rs — complete rewrite
  • crates/ie-html/src/insertion_mode.rs — new file, insertion mode implementations
  • crates/ie-html/src/formatting.rs — new file, active formatting elements + adoption agency
  • crates/ie-html/src/lib.rs — update parse() signature
  • crates/ie-html/tests/html5lib-tree/ — vendored test fixtures
  • crates/ie-html/tests/tree_conformance.rs — test harness

Implementation

Insertion modes (insertion_mode.rs)

  • InsertionMode enum:
    Initial, BeforeHtml, BeforeHead, InHead, InHeadNoscript,
    AfterHead, InBody, Text, InTable, InTableText,
    InCaption, InColumnGroup, InTableBody, InRow, InCell,
    InSelect, InSelectInTable, InTemplate,
    AfterBody, InFrameset, AfterFrameset,
    AfterAfterBody, AfterAfterFrameset
    
  • Each mode is a function: fn process_token(builder: &mut TreeBuilder, token: Token) -> TreeBuilderResult
  • TreeBuilderResult: Continue, Reprocess(Token), SwitchMode(InsertionMode)

TreeBuilder struct (tree_builder.rs)

  • TreeBuilder state:

    pub struct TreeBuilder<'a> {
        doc: Document,
        tokenizer: Tokenizer<'a>,
        insertion_mode: InsertionMode,
        original_insertion_mode: Option<InsertionMode>,  // for "text" mode return
        open_elements: Vec<NodeId>,  // stack of open elements
        active_formatting: Vec<FormattingEntry>,  // active formatting elements
        head_pointer: Option<NodeId>,
        form_pointer: Option<NodeId>,
        template_modes: Vec<InsertionMode>,  // template insertion mode stack
        frameset_ok: bool,
        scripting: bool,  // whether scripting is enabled
        foster_parenting: bool,  // redirect inserts for misgrouped table content
        script_runner: Option<Box<dyn FnMut(&str)>>,  // callback for <script> execution
        errors: Vec<ParseError>,
    }
  • TreeBuilder::parse(html: &str) -> ParseResult:

    pub struct ParseResult {
        pub document: Document,
        pub errors: Vec<ParseError>,
        pub style_elements: Vec<String>,  // contents of <style> elements
        pub link_stylesheets: Vec<String>,  // hrefs of <link rel="stylesheet">
    }

Core tree builder operations

  • insert_element(tag: &StartTag) -> NodeId:

    • Create element in DOM
    • Set attributes
    • Determine correct insertion location (accounting for foster parenting)
    • Append to parent
    • Push onto open_elements stack
  • insert_character(c: char):

    • If last child of current node is a Text node, append to it
    • Otherwise, create new Text node and insert
  • insert_comment(data: &str):

    • Create Comment node, insert at current location
  • generate_implied_end_tags(except: Option<&str>):

    • Pop from open_elements while current node is dd, dt, li, optgroup, option, p, rb, rp, rt, rtc (except the named one)
  • close_element(tag_name: &str):

    • Pop from open_elements until element with matching tag is found
  • reconstruct_active_formatting():

    • Re-open formatting elements that were implicitly closed

Active formatting elements and adoption agency (formatting.rs)

  • FormattingEntry enum: Element(NodeId, StartTag) or Marker
  • Adoption agency algorithm (the most complex part of the tree builder):
    • Handles misnested formatting: <b>text<i>more</b>still italic</i>
    • Walks the open elements and active formatting list
    • Creates new elements, reparents existing nodes
    • Has a loop limit (8 outer, 3 inner per spec) to prevent infinite loops
  • Push/pop markers for <td>, <th>, <caption>, <template>

Foster parenting

  • When foster_parenting is true, insertion goes before the last table element in the stack rather than into the current node
  • Triggered when character/element tokens arrive while in table context but not at a valid table location

Insertion mode implementations (selected critical modes)

  • Initial: handle doctype, set document mode (no-quirks always)
  • BeforeHtml: insert <html> element
  • BeforeHead: insert <head> element
  • InHead: handle <title>, <style>, <script>, <link>, <meta>, <base>
    • Extract <style> text content → store in style_elements
    • Extract <link rel="stylesheet" href="..."> → store in link_stylesheets
    • <script> → if script_runner is set, pause and call it with script content
  • InBody: the main mode — handles all body content
    • Formatting elements (b, i, em, strong, a, etc.) → push to active formatting
    • Block elements (div, p, h1-h6, etc.) → close implicit paragraphs, insert
    • Void elements (br, hr, img, input, etc.) → insert and immediately pop
    • </p> when no p in scope → insert empty <p> then close it (spec quirk)
    • Adoption agency for end tags of formatting elements
  • Text: raw text / RCDATA content (entered from InHead for <style>, <title>)
  • InTable / InTableBody / InRow / InCell: table structure enforcement
  • AfterBody: only whitespace and </html> allowed
  • InTemplate: template contents handling

Tokenizer interaction

  • Switch tokenizer state after certain tags:
    • <script> → ScriptData
    • <style>, <textarea>, <title> → RcData (via appropriate states)
    • <xmp>, <iframe>, <noembed>, <noframes> → RawText
    • <plaintext> → PlainText
  • Provide set_state() callback to tokenizer

Tests

html5lib conformance suite

  • Vendor html5lib-tests tree construction tests from https://github.com/nickthedick/html5lib-tests/tree/master/tree-construction
  • Place in crates/ie-html/tests/html5lib-tree/
  • Test format: each test specifies input HTML and expected DOM tree in a text format:
    | <html>
    |   <head>
    |   <body>
    |     <p>
    |       "Hello"
    
  • Conformance harness (tests/tree_conformance.rs):
    • Parse test files
    • Run TreeBuilder::parse() on each input
    • Serialize resulting DOM to the expected text format
    • Compare against expected output
    • Skip tests marked as quirks-mode-only (annotate them)
    • Target >95% pass rate

Unit tests

  • Minimal document: <!DOCTYPE html><html><head></head><body></body></html> → correct structure
  • Implicit elements: <p>text → html, head, body, p all created
  • Void elements: <br>, <img src="x"> → no children, no close tag needed
  • Nested formatting: <b><i>text</i></b> → correct nesting
  • Misnested formatting: <b>bold<i>both</b>italic</i> → adoption agency creates correct tree
  • Table structure: <table><tr><td>cell</td></tr></table> → tbody implicitly created
  • Foster parenting: <table>text<tr><td>cell</td></tr></table> → text moved before table
  • <style> extraction: style element contents returned in ParseResult.style_elements
  • <link> extraction: stylesheet hrefs returned in ParseResult.link_stylesheets
  • <script> callback: script runner called with correct content
  • Self-closing foreign elements: <svg/> handled correctly
  • Template contents: <template><p>inside</p></template> → p is in template content, not document

Acceptance Criteria

  • cargo test -p ie-html — all tests pass
  • html5lib tree construction suite: >95% pass rate
  • ParseResult provides extracted style and link stylesheet data
  • Script runner callback works for <script> elements
  • cargo clippy -p ie-html -- -D warnings — no warnings

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions