TokenizerState enum with core states:
Data, TagOpen, EndTagOpen, TagName,
BeforeAttributeName, AttributeName, AfterAttributeName,
BeforeAttributeValue, AttributeValueDoubleQuoted, AttributeValueSingleQuoted, AttributeValueUnquoted,
AfterAttributeValueQuoted, SelfClosingStartTag,
BogusComment, MarkupDeclarationOpen,
CommentStart, CommentStartDash, Comment, CommentLessThanSign,
CommentLessThanSignBang, CommentLessThanSignBangDash, CommentLessThanSignBangDashDash,
CommentEndDash, CommentEnd, CommentEndBang,
Doctype, BeforeDoctypeName, DoctypeName, AfterDoctypeName,
AfterDoctypePublicKeyword, BeforeDoctypePublicIdentifier,
DoctypePublicIdentifierDoubleQuoted, DoctypePublicIdentifierSingleQuoted,
AfterDoctypePublicIdentifier, BetweenDoctypePublicAndSystemIdentifiers,
AfterDoctypeSystemKeyword, BeforeDoctypeSystemIdentifier,
DoctypeSystemIdentifierDoubleQuoted, DoctypeSystemIdentifierSingleQuoted,
AfterDoctypeSystemIdentifier, BogusDoctype,
CDataSection, CDataSectionBracket, CDataSectionEnd
Parent: #20
Goal
Implement the core WHATWG HTML tokenizer states: Data, TagOpen, EndTagOpen, TagName, attribute states, comment states, doctype states, self-closing, and EOF. This covers ~40 of the ~80 states — enough to tokenize normal HTML without scripts, raw text, or character references.
File Changes
crates/ie-html/src/token.rs— new file, Token enum and Attribute structcrates/ie-html/src/tokenizer.rs— full rewrite with state machinecrates/ie-html/src/lib.rs— update module declarationsImplementation
Token types (
token.rs)Tokenizer (
tokenizer.rs)TokenizerState enum with core states:
Data, TagOpen, EndTagOpen, TagName,
BeforeAttributeName, AttributeName, AfterAttributeName,
BeforeAttributeValue, AttributeValueDoubleQuoted, AttributeValueSingleQuoted, AttributeValueUnquoted,
AfterAttributeValueQuoted, SelfClosingStartTag,
BogusComment, MarkupDeclarationOpen,
CommentStart, CommentStartDash, Comment, CommentLessThanSign,
CommentLessThanSignBang, CommentLessThanSignBangDash, CommentLessThanSignBangDashDash,
CommentEndDash, CommentEnd, CommentEndBang,
Doctype, BeforeDoctypeName, DoctypeName, AfterDoctypeName,
AfterDoctypePublicKeyword, BeforeDoctypePublicIdentifier,
DoctypePublicIdentifierDoubleQuoted, DoctypePublicIdentifierSingleQuoted,
AfterDoctypePublicIdentifier, BetweenDoctypePublicAndSystemIdentifiers,
AfterDoctypeSystemKeyword, BeforeDoctypeSystemIdentifier,
DoctypeSystemIdentifierDoubleQuoted, DoctypeSystemIdentifierSingleQuoted,
AfterDoctypeSystemIdentifier, BogusDoctype,
CDataSection, CDataSectionBracket, CDataSectionEnd
Tokenizer struct with Iterator impl
State machine step() method
Pending token queue (VecDeque) for multi-token emissions
set_state() for tree builder feedback
Tests (unit tests in tokenizer.rs)
<div>→ StartTag<br/>→ StartTag { self_closing: true }<a href="url" class="c">→ StartTag with attributes</div>→ EndTag<!-- hello -->→ Comment<!DOCTYPE html>→ Doctypehello→ Character tokens<div><p>text</p></div>→ correct sequence<div(EOF in tag) → best-effort token<div id=main>→ correct attributeAcceptance Criteria
cargo test -p ie-html— all tests passcargo clippy -p ie-html -- -D warnings— no warnings