Add streaming HTML tokenizer with CSS and JavaScript delegation #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements a character-by-character HTML tokenizer that delegates
<style>and<script>content to specialized tokenizers while maintaining streaming architecture.Implementation
Core tokenizer: State machine-based HTML parser inheriting from
BaseSubTokenizer<HtmlToken>ElementName,AttributeName,AttributeValue,Comment,DocumentTypeDeclaration, etc.Sub-tokenizer delegation:
<style>content →CssTokenizerwith</style>as stop delimiter<script>content →TypescriptTokenizerwith</script>as stop delimiterMarkdown integration: HTML code blocks now use
HtmlTokenizerinstead ofXmlTokenizer, enabling proper tokenization of embedded CSS/JavaScriptUsage
This enables syntax highlighting for nested structures like Markdown → HTML → CSS/JavaScript.
Files Added
src/NTokenizers/Html/- Token types and tokenizer implementationsrc/NTokenizers/Markdown/Metadata/HtmlCodeBlockMetadata.cs- Markdown integrationtests/NTokenizers.Tests/HtmlTokenizerTests.cs- 15 test casestests/NTokenizers.ShowCase.Html/- Demo project with embedded CSS/JSdocs/html.md- DocumentationOriginal prompt
This section details on the original issue you should resolve
<issue_title>Add html tokenizer</issue_title>
<issue_description>Implementeer een streaming Html tokenizer in C# die volledig karakter-voor-karakter werkt met een state machine, en die kan stoppen bij een optionele
stopDelimiter. Reuse code from BaseSubTokenizerPublic API
The public API is handled by
BaseSubTokenizer<TToken>, so make sure to inherit from this class:The BaseSubTokenizer contains an abstract method, so make sure to override that one:
There is also already a method available that handles the stopDelimiter logic. So override as follows:
Simularities with XmlTokenizer
The implementation should be very simular to the XmlTokenizer. So also mainly copy the token types. There is one large difference though. html can have css and javascript sections. Example:
So in this example the content of the style element should be passed to the CssTokenizer, where
</style>will be the stopDelimiter and the script element should be passed to the TypescriptTokenizer where</script>will be the stopDelimiter.This same mechanism is already present in the
MarkupTokenizerwhich also has inline sections that are handled by other tokenizers (see ParseCodeInlines()), so be sure to check that out.You do not have to handle styling or script within attributes, those can just be emitted as AttributeValue, so nothing special there.
Gedrag
StreamReader.onTokenzodra ze volledig herkend zijn.Implementatie details
Take
XmlTokenizeras an example.NTokenizersrepopordiv, etc will just be emitted as ElementName.xUnit tests
tests\NTokenizers.Tests\HtmlTokenizerTests.cs.stopDelimiter.Show case project
Folderstructuur
HtmlTokenType.cs: definieertHtmlTokenType.HtmlToken.cs: definieertHtmlToken.HtmlTokenizer.cs: bevat deHtmlTokenizer.Parsemethode.NTokenizers.ShowCase.Html.csproj: Showcase project.Program.cs: Html showcase program logic.HtmlTokenizerTests.cs: xUnit tests voor alle tokentypes, keywords, comments, operators, stopDelimiter, etc.Documentatie
Bonus implementation double nesting
It would be nice if the MarkdownTokenizer, can handle HtmlTokenizing (instead of the XmlTokenizer), that can handle CssTokenizing etc. I have not really thought that true. But if it is possible, it would be nice that this double nesting just works. If not we will pick that up in a later iteration.
</issue_description>
Comments on the Issue (you are @copilot in this section)
<com...
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.