Skip to content

Step 1b: HTML tokenizer — script/raw text states and character references #55

@thomasnemer

Description

@thomasnemer

Parent: #20

Goal

Add the remaining WHATWG tokenizer states: RcData, RawText, ScriptData (with all escape sub-states), PlainText, and character reference resolution (named, decimal, hex). This completes the full ~80-state tokenizer.

Prerequisites

File Changes

  • crates/ie-html/src/tokenizer.rs — add remaining states
  • crates/ie-html/src/entities.rs — new file, entity lookup
  • crates/ie-html/build.rs — new file, codegen for entity table
  • crates/ie-html/data/entities.json — vendored WHATWG named character references
  • crates/ie-html/Cargo.toml — add phf, serde_json build-dep

Implementation

Additional tokenizer states

  • RcData, RcDataLessThanSign, RcDataEndTagOpen, RcDataEndTagName
  • RawText, RawTextLessThanSign, RawTextEndTagOpen, RawTextEndTagName
  • ScriptData + all escape states (~15 states)
  • PlainText
  • CharacterReference, NumericCharacterReference, NamedCharacterReference, etc.

Character reference resolution

  • Vendor entities.json from WHATWG spec
  • build.rs: generate phf::Map from entities.json
  • Named entities: longest match, attribute value special rules
  • Numeric: decimal (&#N;) and hex (&#xN;), Windows-1252 replacements

Tests

  • Script content: <script>var x = 1 < 2;</script> → correct tokens
  • Named entity: &amp; → Character('&')
  • Numeric entity: &#60; → Character('<')
  • Hex entity: &#x3C; → Character('<')
  • Entity in attribute: <a href="?a=1&amp;b=2"> → correct attribute value
  • RCDATA: <title>&amp; stuff</title> → entities resolved
  • RawText: <style>.a { }</style> → raw characters
  • State switching via set_state()

Acceptance Criteria

  • All core tests from 1a still pass
  • Script/raw text/entity tests pass
  • Clippy clean

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions