Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 4, 2026

Implements a character-by-character HTML tokenizer that delegates <style> and <script> content to specialized tokenizers while maintaining streaming architecture.

Implementation

  • Core tokenizer: State machine-based HTML parser inheriting from BaseSubTokenizer<HtmlToken>

    • Token types mirror XML tokenizer: ElementName, AttributeName, AttributeValue, Comment, DocumentTypeDeclaration, etc.
    • Stop delimiter support for partial parsing
  • Sub-tokenizer delegation:

    • <style> content → CssTokenizer with </style> as stop delimiter
    • <script> content → TypescriptTokenizer with </script> as stop delimiter
    • Closing tags emitted after sub-tokenizer completes
  • Markdown integration: HTML code blocks now use HtmlTokenizer instead of XmlTokenizer, enabling proper tokenization of embedded CSS/JavaScript

Usage

string html = """
<style>
    body { color: red; }
</style>
<script>
    console.log('Hello');
</script>
""";

await HtmlTokenizer.Create().ParseAsync(stream, token => {
    // CSS and JavaScript content properly tokenized by specialized tokenizers
    Console.WriteLine($"{token.TokenType}: {token.Value}");
});

This enables syntax highlighting for nested structures like Markdown → HTML → CSS/JavaScript.

Files Added

  • src/NTokenizers/Html/ - Token types and tokenizer implementation
  • src/NTokenizers/Markdown/Metadata/HtmlCodeBlockMetadata.cs - Markdown integration
  • tests/NTokenizers.Tests/HtmlTokenizerTests.cs - 15 test cases
  • tests/NTokenizers.ShowCase.Html/ - Demo project with embedded CSS/JS
  • docs/html.md - Documentation
Original prompt

This section details on the original issue you should resolve

<issue_title>Add html tokenizer</issue_title>
<issue_description>Implementeer een streaming Html tokenizer in C# die volledig karakter-voor-karakter werkt met een state machine, en die kan stoppen bij een optionele stopDelimiter. Reuse code from BaseSubTokenizer

Public API

The public API is handled by BaseSubTokenizer<TToken>, so make sure to inherit from this class:

public sealed class HtmlTokenizer : BaseSubTokenizer<HtmlToken>

The BaseSubTokenizer contains an abstract method, so make sure to override that one:

internal protected override Task ParseAsync(CancellationToken ct)

There is also already a method available that handles the stopDelimiter logic. So override as follows:

    internal protected override Task ParseAsync(CancellationToken ct)
    {
        /* define additional state variables here */

        TokenizeCharacters(ct, (c) => ProcessChar(c, /* pass extra state variables here */));

        EmitPending(/* pass extra state variables here */);

        return Task.CompletedTask;
    }

Simularities with XmlTokenizer

The implementation should be very simular to the XmlTokenizer. So also mainly copy the token types. There is one large difference though. html can have css and javascript sections. Example:

<!DOCTYPE html>
<html>
<head>
    <style language="javascript">
        body { font-family: Arial, sans-serif; }
        .container { max-width: 600px; margin: 0 auto; }
        button { padding: 10px 15px; background: #007bff; color: white; border: none; cursor: pointer; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Hello World</h1>
        <button onclick="alert('Clicked!')">Click Me</button>
    </div>
    <script>
        document.addEventListener('DOMContentLoaded', () => {
            console.log('Page loaded');
        });
    </script>
</body>
</html>

So in this example the content of the style element should be passed to the CssTokenizer, where </style> will be the stopDelimiter and the script element should be passed to the TypescriptTokenizer where </script> will be the stopDelimiter.

This same mechanism is already present in the MarkupTokenizer which also has inline sections that are handled by other tokenizers (see ParseCodeInlines()), so be sure to check that out.

You do not have to handle styling or script within attributes, those can just be emitted as AttributeValue, so nothing special there.


Gedrag

  1. Lees de input karakter-voor-karakter van een StreamReader.
  2. Herken en emiteer tokens direct via onToken zodra ze volledig herkend zijn.
  3. Lees alleen vooruit als het echt nodig is.
  4. Parse best-effort, no strict validation.

Implementatie details

Take XmlTokenizer as an example.

  1. Goal:
  • Fully streaming html tokenizer
  • Supports both block-style and flow-style CSS
  • Matches the style of the XML tokenizer in the NTokenizers repo
  • html elements such as p or div, etc will just be emitted as ElementName.

xUnit tests

  • Plaats tests in tests\NTokenizers.Tests\HtmlTokenizerTests.cs.
  • Test All tokens
  • Assert input and output (The Parse method returns a string)
  • Test dat parsing correct stopt bij een opgegeven stopDelimiter.
  • Test script and style elements
  • Test cancellation token

Show case project

  • Add a show case project just like the xml one.
  • Use html with a style and javascript section

Folderstructuur

project-root/
¦
+- src/
¦  +- NTokenizers/
¦     +- Html/
¦        +- HtmlToken.cs
¦        +- HtmlTokenizer.cs
¦
+- tests/
   +- NTokenizers.ShowCase.Html/
      +- NTokenizers.ShowCase.Html.csproj
      +- Program.cs
   +- NTokenizers.Tests/
      +- HtmlTokenizerTests.cs
  • HtmlTokenType.cs : definieert HtmlTokenType.
  • HtmlToken.cs : definieert HtmlToken.
  • HtmlTokenizer.cs : bevat de HtmlTokenizer.Parse methode.
  • NTokenizers.ShowCase.Html.csproj : Showcase project.
  • Program.cs : Html showcase program logic.
  • HtmlTokenizerTests.cs : xUnit tests voor alle tokentypes, keywords, comments, operators, stopDelimiter, etc.

Documentatie

  • Voeg Html toe aan de opsommingen in the README.md van de root (laat het voorbeeld ongemoeid)
  • Voeg Html toe aan de Description opsommingen in the NTokenizers.json
  • Voeg een yaml.md file toe aan de docs folder (for content inspiration take json.md)
  • Voeg yaml toe aan de _config.yml.

Bonus implementation double nesting

It would be nice if the MarkdownTokenizer, can handle HtmlTokenizing (instead of the XmlTokenizer), that can handle CssTokenizing etc. I have not really thought that true. But if it is possible, it would be nice that this double nesting just works. If not we will pick that up in a later iteration.
</issue_description>

Comments on the Issue (you are @copilot in this section)

<com...


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits January 4, 2026 14:55
Co-authored-by: juuul <359686+juuul@users.noreply.github.com>
… which delegates to CSS and TypeScript tokenizers

Co-authored-by: juuul <359686+juuul@users.noreply.github.com>
Copilot AI changed the title [WIP] Add streaming Html tokenizer implementation Add streaming HTML tokenizer with CSS and JavaScript delegation Jan 4, 2026
Copilot AI requested a review from juuul January 4, 2026 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add html tokenizer

2 participants