Skip to content

Step 1a: HTML tokenizer — core states and token types #54

@thomasnemer

Description

@thomasnemer

Parent: #20

Goal

Implement the core WHATWG HTML tokenizer states: Data, TagOpen, EndTagOpen, TagName, attribute states, comment states, doctype states, self-closing, and EOF. This covers ~40 of the ~80 states — enough to tokenize normal HTML without scripts, raw text, or character references.

File Changes

  • crates/ie-html/src/token.rs — new file, Token enum and Attribute struct
  • crates/ie-html/src/tokenizer.rs — full rewrite with state machine
  • crates/ie-html/src/lib.rs — update module declarations

Implementation

Token types (token.rs)

  • Token enum: Doctype, StartTag, EndTag, Character, Comment, Eof
  • Attribute struct: name + value
  • Convenience methods: is_start_tag, is_end_tag

Tokenizer (tokenizer.rs)

  • TokenizerState enum with core states:
    Data, TagOpen, EndTagOpen, TagName,
    BeforeAttributeName, AttributeName, AfterAttributeName,
    BeforeAttributeValue, AttributeValueDoubleQuoted, AttributeValueSingleQuoted, AttributeValueUnquoted,
    AfterAttributeValueQuoted, SelfClosingStartTag,
    BogusComment, MarkupDeclarationOpen,
    CommentStart, CommentStartDash, Comment, CommentLessThanSign,
    CommentLessThanSignBang, CommentLessThanSignBangDash, CommentLessThanSignBangDashDash,
    CommentEndDash, CommentEnd, CommentEndBang,
    Doctype, BeforeDoctypeName, DoctypeName, AfterDoctypeName,
    AfterDoctypePublicKeyword, BeforeDoctypePublicIdentifier,
    DoctypePublicIdentifierDoubleQuoted, DoctypePublicIdentifierSingleQuoted,
    AfterDoctypePublicIdentifier, BetweenDoctypePublicAndSystemIdentifiers,
    AfterDoctypeSystemKeyword, BeforeDoctypeSystemIdentifier,
    DoctypeSystemIdentifierDoubleQuoted, DoctypeSystemIdentifierSingleQuoted,
    AfterDoctypeSystemIdentifier, BogusDoctype,
    CDataSection, CDataSectionBracket, CDataSectionEnd

  • Tokenizer struct with Iterator impl

  • State machine step() method

  • Pending token queue (VecDeque) for multi-token emissions

  • set_state() for tree builder feedback

Tests (unit tests in tokenizer.rs)

  • Simple tag: <div> → StartTag
  • Self-closing: <br/> → StartTag { self_closing: true }
  • Attributes: <a href="url" class="c"> → StartTag with attributes
  • End tag: </div> → EndTag
  • Comment: <!-- hello --> → Comment
  • Doctype: <!DOCTYPE html> → Doctype
  • Character data: hello → Character tokens
  • Nested tags: <div><p>text</p></div> → correct sequence
  • Malformed: <div (EOF in tag) → best-effort token
  • Unquoted attributes: <div id=main> → correct attribute
  • Multiple attributes with mixed quoting

Acceptance Criteria

  • cargo test -p ie-html — all tests pass
  • cargo clippy -p ie-html -- -D warnings — no warnings
  • Tokenizer is iterator-based
  • Parse errors logged, never abort

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions