Add streaming HTML tokenizer with CSS and JavaScript delegation #39

Copilot · 2026-01-04T14:46:40Z

Implements a character-by-character HTML tokenizer that delegates <style> and <script> content to specialized tokenizers while maintaining streaming architecture.

Implementation

Core tokenizer: State machine-based HTML parser inheriting from BaseSubTokenizer<HtmlToken>
- Token types mirror XML tokenizer: ElementName, AttributeName, AttributeValue, Comment, DocumentTypeDeclaration, etc.
- Stop delimiter support for partial parsing
Sub-tokenizer delegation:
- <style> content → CssTokenizer with </style> as stop delimiter
- <script> content → TypescriptTokenizer with </script> as stop delimiter
- Closing tags emitted after sub-tokenizer completes
Markdown integration: HTML code blocks now use HtmlTokenizer instead of XmlTokenizer, enabling proper tokenization of embedded CSS/JavaScript

Usage

string html = """
<style>
    body { color: red; }
</style>
<script>
    console.log('Hello');
</script>
""";

await HtmlTokenizer.Create().ParseAsync(stream, token => {
    // CSS and JavaScript content properly tokenized by specialized tokenizers
    Console.WriteLine($"{token.TokenType}: {token.Value}");
});

This enables syntax highlighting for nested structures like Markdown → HTML → CSS/JavaScript.

Files Added

src/NTokenizers/Html/ - Token types and tokenizer implementation
src/NTokenizers/Markdown/Metadata/HtmlCodeBlockMetadata.cs - Markdown integration
tests/NTokenizers.Tests/HtmlTokenizerTests.cs - 15 test cases
tests/NTokenizers.ShowCase.Html/ - Demo project with embedded CSS/JS
docs/html.md - Documentation

Original prompt

This section details on the original issue you should resolve

<issue_title>Add html tokenizer</issue_title>
<issue_description>Implementeer een streaming Html tokenizer in C# die volledig karakter-voor-karakter werkt met een state machine, en die kan stoppen bij een optionele stopDelimiter. Reuse code from BaseSubTokenizer

Public API

The public API is handled by BaseSubTokenizer<TToken>, so make sure to inherit from this class:

public sealed class HtmlTokenizer : BaseSubTokenizer<HtmlToken>

The BaseSubTokenizer contains an abstract method, so make sure to override that one:

internal protected override Task ParseAsync(CancellationToken ct)

There is also already a method available that handles the stopDelimiter logic. So override as follows:

    internal protected override Task ParseAsync(CancellationToken ct)
    {
        /* define additional state variables here */

        TokenizeCharacters(ct, (c) => ProcessChar(c, /* pass extra state variables here */));

        EmitPending(/* pass extra state variables here */);

        return Task.CompletedTask;
    }

Simularities with XmlTokenizer

The implementation should be very simular to the XmlTokenizer. So also mainly copy the token types. There is one large difference though. html can have css and javascript sections. Example:

<!DOCTYPE html>
<html>
<head>
    <style language="javascript">
        body { font-family: Arial, sans-serif; }
        .container { max-width: 600px; margin: 0 auto; }
        button { padding: 10px 15px; background: #007bff; color: white; border: none; cursor: pointer; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Hello World</h1>
        <button onclick="alert('Clicked!')">Click Me</button>
    </div>
    <script>
        document.addEventListener('DOMContentLoaded', () => {
            console.log('Page loaded');
        });
    </script>
</body>
</html>

So in this example the content of the style element should be passed to the CssTokenizer, where </style> will be the stopDelimiter and the script element should be passed to the TypescriptTokenizer where </script> will be the stopDelimiter.

This same mechanism is already present in the MarkupTokenizer which also has inline sections that are handled by other tokenizers (see ParseCodeInlines()), so be sure to check that out.

You do not have to handle styling or script within attributes, those can just be emitted as AttributeValue, so nothing special there.

Gedrag

Lees de input karakter-voor-karakter van een StreamReader.
Herken en emiteer tokens direct via onToken zodra ze volledig herkend zijn.
Lees alleen vooruit als het echt nodig is.
Parse best-effort, no strict validation.

Implementatie details

Take XmlTokenizer as an example.

Goal:

Fully streaming html tokenizer
Supports both block-style and flow-style CSS
Matches the style of the XML tokenizer in the NTokenizers repo
html elements such as p or div, etc will just be emitted as ElementName.

xUnit tests

Plaats tests in tests\NTokenizers.Tests\HtmlTokenizerTests.cs.
Test All tokens
Assert input and output (The Parse method returns a string)
Test dat parsing correct stopt bij een opgegeven stopDelimiter.
Test script and style elements
Test cancellation token

Show case project

Add a show case project just like the xml one.
Use html with a style and javascript section

Folderstructuur

project-root/
¦
+- src/
¦  +- NTokenizers/
¦     +- Html/
¦        +- HtmlToken.cs
¦        +- HtmlTokenizer.cs
¦
+- tests/
   +- NTokenizers.ShowCase.Html/
      +- NTokenizers.ShowCase.Html.csproj
      +- Program.cs
   +- NTokenizers.Tests/
      +- HtmlTokenizerTests.cs

HtmlTokenType.cs : definieert HtmlTokenType.
HtmlToken.cs : definieert HtmlToken.
HtmlTokenizer.cs : bevat de HtmlTokenizer.Parse methode.
NTokenizers.ShowCase.Html.csproj : Showcase project.
Program.cs : Html showcase program logic.
HtmlTokenizerTests.cs : xUnit tests voor alle tokentypes, keywords, comments, operators, stopDelimiter, etc.

Documentatie

Voeg Html toe aan de opsommingen in the README.md van de root (laat het voorbeeld ongemoeid)
Voeg Html toe aan de Description opsommingen in the NTokenizers.json
Voeg een yaml.md file toe aan de docs folder (for content inspiration take json.md)
Voeg yaml toe aan de _config.yml.

Bonus implementation double nesting

It would be nice if the MarkdownTokenizer, can handle HtmlTokenizing (instead of the XmlTokenizer), that can handle CssTokenizing etc. I have not really thought that true. But if it is possible, it would be nice that this double nesting just works. If not we will pick that up in a later iteration.
</issue_description>

Comments on the Issue (you are @copilot in this section)

<com...

Fixes Add html tokenizer #38

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: juuul <359686+juuul@users.noreply.github.com>

… which delegates to CSS and TypeScript tokenizers Co-authored-by: juuul <359686+juuul@users.noreply.github.com>

Initial plan

c4941fe

Copilot AI assigned Copilot and juuul Jan 4, 2026

Copilot started work on behalf of juuul January 4, 2026 14:47 View session

Copilot AI and others added 2 commits January 4, 2026 14:55

Implement HTML tokenizer with CSS and JavaScript support

9f36d10

Co-authored-by: juuul <359686+juuul@users.noreply.github.com>

Add bonus double nesting support - Markdown can now use HtmlTokenizer…

c1f478e

… which delegates to CSS and TypeScript tokenizers Co-authored-by: juuul <359686+juuul@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add streaming Html tokenizer implementation~~ Add streaming HTML tokenizer with CSS and JavaScript delegation Jan 4, 2026

Copilot AI requested a review from juuul January 4, 2026 15:04

Copilot finished work on behalf of juuul January 4, 2026 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add streaming HTML tokenizer with CSS and JavaScript delegation #39

Add streaming HTML tokenizer with CSS and JavaScript delegation #39

Uh oh!

Copilot AI commented Jan 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add streaming HTML tokenizer with CSS and JavaScript delegation #39

Are you sure you want to change the base?

Add streaming HTML tokenizer with CSS and JavaScript delegation #39

Uh oh!

Conversation

Copilot AI commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Usage

Files Added

Public API

Simularities with XmlTokenizer

Gedrag

Implementatie details

xUnit tests

Show case project

Folderstructuur

Documentatie

Bonus implementation double nesting

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 4, 2026 •

edited

Loading