Parent: #20
Goal
Add the remaining WHATWG tokenizer states: RcData, RawText, ScriptData (with all escape sub-states), PlainText, and character reference resolution (named, decimal, hex). This completes the full ~80-state tokenizer.
Prerequisites
File Changes
crates/ie-html/src/tokenizer.rs — add remaining states
crates/ie-html/src/entities.rs — new file, entity lookup
crates/ie-html/build.rs — new file, codegen for entity table
crates/ie-html/data/entities.json — vendored WHATWG named character references
crates/ie-html/Cargo.toml — add phf, serde_json build-dep
Implementation
Additional tokenizer states
- RcData, RcDataLessThanSign, RcDataEndTagOpen, RcDataEndTagName
- RawText, RawTextLessThanSign, RawTextEndTagOpen, RawTextEndTagName
- ScriptData + all escape states (~15 states)
- PlainText
- CharacterReference, NumericCharacterReference, NamedCharacterReference, etc.
Character reference resolution
- Vendor entities.json from WHATWG spec
- build.rs: generate phf::Map from entities.json
- Named entities: longest match, attribute value special rules
- Numeric: decimal (&#N;) and hex (&#xN;), Windows-1252 replacements
Tests
- Script content:
<script>var x = 1 < 2;</script> → correct tokens
- Named entity:
& → Character('&')
- Numeric entity:
< → Character('<')
- Hex entity:
< → Character('<')
- Entity in attribute:
<a href="?a=1&b=2"> → correct attribute value
- RCDATA:
<title>& stuff</title> → entities resolved
- RawText:
<style>.a { }</style> → raw characters
- State switching via set_state()
Acceptance Criteria
- All core tests from 1a still pass
- Script/raw text/entity tests pass
- Clippy clean
Parent: #20
Goal
Add the remaining WHATWG tokenizer states: RcData, RawText, ScriptData (with all escape sub-states), PlainText, and character reference resolution (named, decimal, hex). This completes the full ~80-state tokenizer.
Prerequisites
File Changes
crates/ie-html/src/tokenizer.rs— add remaining statescrates/ie-html/src/entities.rs— new file, entity lookupcrates/ie-html/build.rs— new file, codegen for entity tablecrates/ie-html/data/entities.json— vendored WHATWG named character referencescrates/ie-html/Cargo.toml— add phf, serde_json build-depImplementation
Additional tokenizer states
Character reference resolution
Tests
<script>var x = 1 < 2;</script>→ correct tokens&→ Character('&')<→ Character('<')<→ Character('<')<a href="?a=1&b=2">→ correct attribute value<title>& stuff</title>→ entities resolved<style>.a { }</style>→ raw charactersAcceptance Criteria