Initial, BeforeHtml, BeforeHead, InHead, InHeadNoscript,
AfterHead, InBody, Text, InTable, InTableText,
InCaption, InColumnGroup, InTableBody, InRow, InCell,
InSelect, InSelectInTable, InTemplate,
AfterBody, InFrameset, AfterFrameset,
AfterAfterBody, AfterAfterFrameset
Parent: #19
Goal
Implement the WHATWG tree construction algorithm that consumes tokens from the tokenizer (Step 1) and builds a DOM tree (ie-dom). The tree builder is a state machine with ~20 insertion modes that handles the complex rules of HTML nesting, error recovery, and implicit element creation.
Prerequisites
File Changes
crates/ie-html/src/tree_builder.rs— complete rewritecrates/ie-html/src/insertion_mode.rs— new file, insertion mode implementationscrates/ie-html/src/formatting.rs— new file, active formatting elements + adoption agencycrates/ie-html/src/lib.rs— updateparse()signaturecrates/ie-html/tests/html5lib-tree/— vendored test fixturescrates/ie-html/tests/tree_conformance.rs— test harnessImplementation
Insertion modes (
insertion_mode.rs)InsertionModeenum:fn process_token(builder: &mut TreeBuilder, token: Token) -> TreeBuilderResultTreeBuilderResult: Continue, Reprocess(Token), SwitchMode(InsertionMode)TreeBuilder struct (
tree_builder.rs)TreeBuilderstate:TreeBuilder::parse(html: &str) -> ParseResult:Core tree builder operations
insert_element(tag: &StartTag) -> NodeId:insert_character(c: char):insert_comment(data: &str):generate_implied_end_tags(except: Option<&str>):close_element(tag_name: &str):reconstruct_active_formatting():Active formatting elements and adoption agency (
formatting.rs)FormattingEntryenum:Element(NodeId, StartTag)orMarker<b>text<i>more</b>still italic</i><td>,<th>,<caption>,<template>Foster parenting
foster_parentingis true, insertion goes before the last table element in the stack rather than into the current nodeInsertion mode implementations (selected critical modes)
<html>element<head>element<title>,<style>,<script>,<link>,<meta>,<base><style>text content → store instyle_elements<link rel="stylesheet" href="...">→ store inlink_stylesheets<script>→ ifscript_runneris set, pause and call it with script content</p>when no p in scope → insert empty<p>then close it (spec quirk)<style>,<title>)</html>allowedTokenizer interaction
<script>→ ScriptData<style>,<textarea>,<title>→ RcData (via appropriate states)<xmp>,<iframe>,<noembed>,<noframes>→ RawText<plaintext>→ PlainTextset_state()callback to tokenizerTests
html5lib conformance suite
crates/ie-html/tests/html5lib-tree/tests/tree_conformance.rs):TreeBuilder::parse()on each inputUnit tests
<!DOCTYPE html><html><head></head><body></body></html>→ correct structure<p>text→ html, head, body, p all created<br>,<img src="x">→ no children, no close tag needed<b><i>text</i></b>→ correct nesting<b>bold<i>both</b>italic</i>→ adoption agency creates correct tree<table><tr><td>cell</td></tr></table>→ tbody implicitly created<table>text<tr><td>cell</td></tr></table>→ text moved before table<style>extraction: style element contents returned inParseResult.style_elements<link>extraction: stylesheet hrefs returned inParseResult.link_stylesheets<script>callback: script runner called with correct content<svg/>handled correctly<template><p>inside</p></template>→ p is in template content, not documentAcceptance Criteria
cargo test -p ie-html— all tests passParseResultprovides extracted style and link stylesheet data<script>elementscargo clippy -p ie-html -- -D warnings— no warnings