feat: Custom Tokenizer/Renderer extensions #2043

calculuschild · 2021-05-08T06:34:01Z

Description

New attempt at custom Tokens, Renderers for Marked.js. Faster than #1872.

May be a possible fix for: #1373, #1693, #1695, #2061,

Users can add a custom Tokenizer and Renderer which will be executed without needing to overwrite existing components, with the extension in the following format:

const myExtension = {
  extensions: [{
    name: 'underline',
    start(src) { return src.match(/:/)?.index; },  // First characters of your token so Marked.js knows to stop and check for a match
    level: 'block',        // Is this a block-level or inline-level tokenizer?
    tokenizer(src, tokens) {
      const rule = /^:([^\n]*)(?:\n|$)/;  // Regex for the complete token
      const match = rule.exec(src);
      if (match) {
        return {                          // Token to generate
          type: 'underline',                // Should match "name" above
          raw: match[0],                    // The text that you want your token to consume from the source
          text: match[1].trim()             // Any custom properties you want the Renderer to access
        };
      }
    },
    renderer (token) {
      return `<u>${token.text}</u>\n`;
    }]
};

The extension(s) can then be loaded like so:

marked.use(myExtension, extension2, extension3);

\\ EQUIVALENT TO:

marked.use(myExtension);
marked.use(extension2);
marked.use(extension3);

Benchmarks (on my laptop):

Master from Feb 2021

es5 marked completed in 4260ms and passed 82.74%
es6 marked completed in 4289ms and passed 82.74%
es5 marked (gfm) completed in 4518ms and passed 82.13%
es6 marked (gfm) completed in 4504ms and passed 82.13%
es5 marked (pedantic) completed in 4596ms and passed 61.48%
es6 marked (pedantic) completed in 4783ms and passed 61.48%
commonmark completed in 3617ms and passed 100.00%
markdown-it completed in 3646ms and passed 89.21%

Master (current)

es5 marked completed in 4504ms and passed 86.90%
es6 marked completed in 4622ms and passed 86.90%
es5 marked (gfm) completed in 4776ms and passed 86.29%
es6 marked (gfm) completed in 5029ms and passed 86.29%
es5 marked (pedantic) completed in 4674ms and passed 71.03%
es6 marked (pedantic) completed in 5007ms and passed 71.03%
commonmark completed in 3678ms and passed 100.00%
markdown-it completed in 3672ms and passed 89.21%

This PR

es5 marked completed in 4761ms and passed 86.90%
es6 marked completed in 4814ms and passed 86.90%
es5 marked (gfm) completed in 5115ms and passed 86.29%
es6 marked (gfm) completed in 5105ms and passed 86.29%
es5 marked (pedantic) completed in 4972ms and passed 71.03%
es6 marked (pedantic) completed in 5069ms and passed 71.03%
commonmark completed in 3619ms and passed 100.00%
markdown-it completed in 3735ms and passed 89.21%

This PR with all extensions running at the top of the Lexer rather than using "before" to spread them out:

es5 marked completed in 4591ms and passed 86.90%
es6 marked completed in 4644ms and passed 86.90%
es5 marked (gfm) completed in 4891ms and passed 86.29%
es6 marked (gfm) completed in 4804ms and passed 86.29%
es5 marked (pedantic) completed in 4770ms and passed 71.03%
es6 marked (pedantic) completed in 4799ms and passed 71.03%
commonmark completed in 3617ms and passed 100.00%
markdown-it completed in 3587ms and passed 89.21%

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

CI is green (no forced merge required).
Squash and Merge PR following conventional commit guidelines.

vercel · 2021-05-08T06:34:06Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/GBXN7V5EfC83A6yEQUUynLXjXD2h
✅ Preview: https://markedjs-git-fork-calculuschild-markedextensions-markedjs.vercel.app

UziTech · 2021-05-08T06:52:36Z

We should definitely only run through the extensions once and create a map like:

this.extensions = {
  paragraph: [
    underline
  ]
}

Then looking them up like:

runTokenzierExtension(src, tokens, before) {
  let tokensLength = 0;
  if (this.extensions[before]) {
    this.extensions[before].forEach(function(extension, index) {
      if (token = extension.tokenizer(src)) {
        src = src.substring(token.raw.length);
        tokens.push(token);
        tokensLength += token.raw.length;
      }
    });
  }
  return tokensLength;
}

src/Parser.js

calculuschild · 2021-05-08T07:14:28Z

Used a similar method to your inlineText tokenizer on the paragraph tokenizer. Now if there are extensions, it looks for the "start" regex and cuts src down to that index before continuing on.

Although... this would make customizing the paragraph tokenizer much more difficult. Maybe this logic can go into the Lexer so the paragraph tokenizer remains easily editable.

calculuschild · 2021-05-08T17:17:38Z

@UziTech Is there a way to clear the extensions/options between unit tests?

Edit: Let's see... I've found marked.defaults = marked.getDefaults(); Is that the best way to do it?

Edit2: Bah. For some reason that causes tests in Parser-spec.js to fail in very strange ways even though I'm not touching that file at all. Text from the marked-spec.js is showing up in the results there for some reason? I wonder if the async tests are to blame. Commented out for now...

calculuschild · 2021-05-09T02:32:02Z

Added an "inline" version of my underline token and the extension works.

I was looking at the header-ids extension but I'm not sure how to best approach that. I see a few options:

Overwrite the Heading renderer (can already do this)
Overwrite the Heading tokenizer (can already do this)
Insert a custom Block HeadingID extension which runs before the normal Heading tokenizer (simple enough with this PR)
Have a custom Inline "ID" extension that we could apply more generally to whatever the parent token is (to add an id to an image, to a codespan, etc.) (not possible with this PR yet)

So this brings up some thoughts:

Do we want to change the syntax for overwriting existing renderers/tokenizers to somehow follow this new style? (i.e., an optional parameter overwrite : 'heading' instead of before : 'heading'.)
I would be interested in getting something like 4) above to work. There are existing markdown extensions that add some snippet of text inside of another block element to give it a particular style or html property, such as image sizing, or Pandoc's fenced divs which is an example of re-using the {#id} syntax from the header-ids extension in different block types, and it would be neat if it didn't require fully overwriting each block element you want to apply the snippet to. This would probably require passing the previous Token or parent token in was a parameter to the custom tokenizer but I'm not sure if that would be a good thing.
Are there any extra parameters that we obviously should pass in that our other tokenizers have needed? PrevChar, etc?

UziTech · 2021-05-10T02:15:00Z

Is there a way to clear the extensions/options between unit tests?

You should add extensions: null to the defaults it should be reset between tests.

UziTech · 2021-05-10T02:19:54Z

Do we want to change the syntax for overwriting existing renderers/tokenizers to somehow follow this new style? (i.e., an optional parameter overwrite : 'heading' instead of before : 'heading'.)

Good question. I'm not sure if it is better to have multiple ways to do the same thing. I think it would depend on whether it would slow down marked without extensions too much.

calculuschild · 2021-05-10T03:00:06Z

it would depend on whether it would slow down marked without extensions too much.

I was imagining just a change in the Marked.js page, to handle a different extension format but we would parse it into the same marked.defaults.renderer or whatever so speed should be similar when executing.

I would probably be in favor of just having one way to do things, and I also like the idea of having consistency between how extensions are formatted.

UziTech · 2021-05-10T03:02:11Z

Sounds good. We could add this way of extensions and deprecate the old way and remove it in v3 or v4

UziTech · 2021-05-10T03:24:21Z

I would be interested in getting something like 4) above to work.

We could pass the parent token or something if it is not a top level token so the tokenizer or renderer could do different things depending on what type of token it is in.

calculuschild · 2021-05-10T13:48:42Z

We could add this way of extensions and deprecate the old way and remove it in v3 or v4

Ok, so if we do this and require each extension to be formatted as an object, what about supporting passing in both a single extension object or an array of extensions (rather than a nested object of extensions) so order is preserved for consistency. Like so:

extension1 = {
  name : "extension1",
  overwrite : "header",
  level: "block",
  tokenizer : (src) =>  {}
... }

extension2 = {
  name: "extension2",
  before: "paragraph",
  level: "block",
  tokenizer : (src) => {}
... }

marked.use(extension1);
marked.use(extension2);
// EQUIVALENT TO
marked.use([extension1, extension2]);

If an extension has "overwrite", treat it as we currently treat extensions, i.e., merge into marked.defaults.

Otherwise, merge into an extensions map depending on the contents of before.

calculuschild · 2021-05-11T04:05:07Z

We should definitely only run through the extensions once and create a map like:

@UziTech I applied all the changes discussed above. Processing in marked.js got a bit ugly (and I'm probably missing special cases/error handling, which I am not experienced with) but the code elsewhere is much simplified now without the looping and results in an extensions object of the following format:

extensions {
  [beforeA] : [tokenizer1, tokenizer2],
  [beforeB] : [tokenizer3],
  'last' : [tokenizer4],
  [nameA] : renderer1,
  [nameB] : renderer2,
  startBlock : /regex1|regex2|regex3/,
  startInline : /regex4|regex5/

Overwriting extensions now require existence of the property overwrite : true, and they will be merged into defaults as before.

Custom extensions with Tokenizers will be added to the before keyword last by default, and will run after all of the default tokenizers, unless the user supplies a valid before property. A valid start value is needed for inline tokenizers to work properly, and must have start for block tokenizers only if they want to be able to interrupt paragraphs, a la tables, headers, etc.

Custom extensions with Renderers must have a name property.

start regexes are now merged into one to avoid looping and to ensure the first match index is chosen.

test/unit/marked-spec.js

src/Lexer.js

calculuschild · 2021-05-11T20:41:36Z

Added current benchmarks to the OP. Aw man... Not sure where the slowdown is coming from yet.

Edit: Ah... Most of it is because I was accidentally comparing to my (old, but faster) fork of the Master branch. I updated the original post comparing the four scenarios.

calculuschild · 2021-05-11T22:03:56Z

Now that I think about it.... Do we NEED a before parameter? What if we just have all the custom extensions run one after another very first thing to cut down on the number of calls we need sprinkled throughout the Lexer? I can't actually think of a case where a user would want to add a custom tokenizers that specifically runs after the others. We must have had at least one example of it before but I can't find it anywhere.

I have this working on my local branch and it actually removes most of the slowdown. If that sounds reasonable I can commit it.

UziTech · 2021-05-12T02:41:34Z

Ya that sounds fine. If we need to allow running between other tokenizers we can add that later.

UziTech · 2021-06-01T04:16:46Z

What do you think about passing in like a previousToken and nextToken to the walktokens?

I have thought about adding all of those things. The problem is: where do you stop? it could also be helpful to know what the parent token is ... and the grandparent token ... and all of the siblings.

I think the best thing to do is say walkTokens is simply a convenience function for walking each token and altering the tokens just based on that token. If someone wants to do more they can get the tokens and walk them however they want manually.

calculuschild · 2021-06-01T04:34:06Z

they can get the tokens and walk them however they want manually.

How would a user go about doing that? Just accessing the Lexer?

The problem is: where do you stop?

Right... Ok, what if we do this instead: after a quick google, apparently we could get the index of the current token being walked with for [index, token] of tokens.entries(). Then we could pass in index and tokens which would give access to all siblings, and you know your position within the siblings. Children are already accessible. And parents/grandparents could be accessed by just walking them and checking if they contain the child token of interest, then manipulating that child. I think that gives access to anything in the token tree. You would just have to write your walktokens function to look sideways or down.

UziTech · 2021-06-01T05:25:09Z

How would a user go about doing that? Just accessing the Lexer?

yes

parents/grandparents could be accessed by just walking them and checking if they contain the child token of interest

How is that easier than just accessing tokens from the lexer and walking them manually?

If you want extensions to be able to get tokens we should do something like hooks and just have a hook that gives the extension all tokens between lex and parse and let them do whatever they want. Same with a hook for pre and post processing.

For the gfm-heading-id extension it would be nice to be be able to reset the slugger in a preprocess hook instead of needing to call it manually.

calculuschild · 2021-06-01T13:30:49Z

How is that easier than just accessing tokens from the lexer and walking them manually?

Because then the extension creator essentially has to re-write his own custom walkTokens logic and break apart the Marked pipeline to manually call the lexer, then run his custom walkTokens, then call the parser, instead of just using marked('my Markdown'). This is a headache if anyone else actually wants to use that extension, instead of just plugging into marked.use() and it "just works".

The preprocessor/postprocessor hooks would make it work with marked.use() so you don't have to break apart the pipeline, but it still requires users to build their own walkTokens from scratch which seems redundant given that we already have walkTokens.

Ok, so another approach: What if inside of walkTokens we inject two properties into the token immediately before calling the extension functions. A previousToken and a parentToken. Then each token can navigate up (token.parentToken.parentToken.parentToken), down (token.tokens.tokens), or backward to previous siblings (token.previousToken.previousToken). You could even get away with just the previousToken. And you just need to handle it like CSS where your selectors can only access the next element or child elements in the list, but in reverse (you can only access previous or child items, since next siblings wouldn't have been given their previousToken yet.)

So if you want to change a token between two paragraphs:

tokens = [
token1 {type: 'paragraph' ...},
token2 {type: 'table', previousToken: token1 ...}
token3 {type: 'paragraph', previousToken: token2 ...}
];

//WalkTokens
walkTokens(token) {
  if(token?.previousToken?.previousToken.type === 'paragraph' && token.type === 'paragraph') {
    token.previousToken.text += `I'm between two paragraphs!`;
  }
}

Essentially the only limitation with walkTokens as it is now, is accessing siblings. You can already do everything you need to do with parent/grandparent/child by just starting your function at the parent or grandparent and looking down the children.

UziTech · 2021-06-01T13:57:56Z

Except down isn't always in tokens which means they would have to re-create the logic of walkTokens anyway. Again I don't see this being easier than walking manually. Also I think this PR has done enough for now. We should get this merged and improve it from there.

calculuschild · 2021-06-01T14:07:05Z

Also I think this PR has done enough for now.

Fair enough. I'll work on that last change.

…arked into MarkedExtensions

UziTech

nice work. 👍 This is going to allow so many new things for marked.

calculuschild · 2021-06-10T17:26:44Z

@styfle , @joshbruce , @davisjam

Have you guys had a chance to look at this PR? I know its a hefty one so we could sure use some extra eyes on it!

joshbruce · 2021-06-13T21:32:45Z

I'd be lying if I said I understood it all. I appreciate that performance isn't hit too hard given this is something the community has been wanting directly (or by proxy) for a while now. It seems like at least a couple of members from the community appreciate it as well. So, I'm good.

One thing I'd like to put a pin in is the use of null - just saw it in a new line for this PR - probably something for later and not sure how we could remove the reference...kind of a personal drive I'm starting to get with my coding style.

# [2.1.0](v2.0.7...v2.1.0) (2021-06-15) ### Features * Custom Tokenizer/Renderer extensions ([#2043](#2043)) ([5be9d6d](5be9d6d))

github-actions · 2021-06-15T23:24:40Z

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

priyeshs11 · 2021-06-16T00:31:36Z

src/marked.js

-        return ret;
-      };
+        if (ext.renderer) { // Renderer extensions
+          const prevRenderer = extensions.renderers?.[ext.name];


@UziTech @calculuschild I think this line is causing the error below. Will create a seperate issue too.

node_modules/protobufjs/node_modules/marked/src/marked.js:158 const prevRenderer = extensions.renderers?.[ext.name]; ^ SyntaxError: Unexpected token '.'

Issue here: #2107

vercel bot deployed to Preview May 8, 2021 06:34 View deployment

calculuschild mentioned this pull request May 8, 2021

Rework Lexer to use extendable array of tokenizer functions #1872

Closed

5 tasks

UziTech reviewed May 8, 2021

View reviewed changes

src/Parser.js Outdated Show resolved Hide resolved

vercel bot deployed to Preview May 8, 2021 07:13 View deployment

vercel bot deployed to Preview May 8, 2021 14:06 View deployment

vercel bot deployed to Preview May 8, 2021 17:26 View deployment

vercel bot deployed to Preview May 8, 2021 17:37 View deployment

vercel bot deployed to Preview May 9, 2021 02:03 View deployment

vercel bot deployed to Preview May 11, 2021 03:49 View deployment

UziTech reviewed May 11, 2021

View reviewed changes

test/unit/marked-spec.js Outdated Show resolved Hide resolved

UziTech reviewed May 11, 2021

View reviewed changes

src/Lexer.js Outdated Show resolved Hide resolved

vercel bot deployed to Preview May 11, 2021 20:35 View deployment

vercel bot deployed to Preview May 11, 2021 20:51 View deployment

vercel bot deployed to Preview May 11, 2021 21:05 View deployment

vercel bot deployed to Preview May 12, 2021 15:38 View deployment

calculuschild added 2 commits June 1, 2021 21:59

Change marked.use to accept multiple parameters instead of an array

9eb1aa8

Merge branch 'MarkedExtensions' of https://github.com/calculuschild/m…

4c9b38f

…arked into MarkedExtensions

vercel bot deployed to Preview June 2, 2021 01:59 View deployment

update docs

0664ddd

vercel bot deployed to Preview June 2, 2021 02:03 View deployment

calculuschild requested a review from UziTech June 2, 2021 03:10

UziTech approved these changes Jun 2, 2021

View reviewed changes

UziTech requested review from davisjam, joshbruce and styfle June 2, 2021 04:51

joshbruce approved these changes Jun 13, 2021

View reviewed changes

UziTech changed the title ~~Custom Tokenizer/Renderer extensions~~ feat: Custom Tokenizer/Renderer extensions Jun 15, 2021

UziTech merged commit 5be9d6d into markedjs:master Jun 15, 2021

github-actions bot pushed a commit that referenced this pull request Jun 15, 2021

chore(release): 2.1.0 [skip ci]

a86701f

# [2.1.0](v2.0.7...v2.1.0) (2021-06-15) ### Features * Custom Tokenizer/Renderer extensions ([#2043](#2043)) ([5be9d6d](5be9d6d))

github-actions bot added the released label Jun 15, 2021

priyeshs11 reviewed Jun 16, 2021

View reviewed changes

magicmatatjahu mentioned this pull request Jun 16, 2021

Unable to generate : prevRenderer issue asyncapi/html-template#214

Closed

UziTech mentioned this pull request Jun 20, 2021

[WIP] extensible Parser: separate methods for token types #2038

Closed

bonartm mentioned this pull request Jun 27, 2021

Close #8: math rendering bonartm/quizdown-js#9

Merged

2 tasks

calculuschild mentioned this pull request Oct 4, 2021

Extending with custom tags #1373

Closed

calculuschild mentioned this pull request Jan 19, 2024

Register Walktokens Behavior for extensions with different token structure #3170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Custom Tokenizer/Renderer extensions #2043

feat: Custom Tokenizer/Renderer extensions #2043

calculuschild commented May 8, 2021 •

edited

vercel bot commented May 8, 2021 •

edited

UziTech commented May 8, 2021 •

edited

calculuschild commented May 8, 2021 •

edited

calculuschild commented May 8, 2021 •

edited

calculuschild commented May 9, 2021 •

edited

UziTech commented May 10, 2021

UziTech commented May 10, 2021

calculuschild commented May 10, 2021

UziTech commented May 10, 2021

UziTech commented May 10, 2021

calculuschild commented May 10, 2021 •

edited

calculuschild commented May 11, 2021

calculuschild commented May 11, 2021 •

edited

calculuschild commented May 11, 2021 •

edited

UziTech commented May 12, 2021

UziTech commented Jun 1, 2021

calculuschild commented Jun 1, 2021 •

edited

UziTech commented Jun 1, 2021

calculuschild commented Jun 1, 2021 •

edited

UziTech commented Jun 1, 2021

calculuschild commented Jun 1, 2021

UziTech left a comment

calculuschild commented Jun 10, 2021

joshbruce commented Jun 13, 2021

github-actions bot commented Jun 15, 2021

priyeshs11 Jun 16, 2021 •

edited

priyeshs11 Jun 16, 2021

feat: Custom Tokenizer/Renderer extensions #2043

feat: Custom Tokenizer/Renderer extensions #2043

Conversation

calculuschild commented May 8, 2021 • edited

Description

Benchmarks (on my laptop):

Master from Feb 2021

Master (current)

This PR

This PR with all extensions running at the top of the Lexer rather than using "before" to spread them out:

Contributor

Committer

vercel bot commented May 8, 2021 • edited

UziTech commented May 8, 2021 • edited

calculuschild commented May 8, 2021 • edited

calculuschild commented May 8, 2021 • edited

calculuschild commented May 9, 2021 • edited

UziTech commented May 10, 2021

UziTech commented May 10, 2021

calculuschild commented May 10, 2021

UziTech commented May 10, 2021

UziTech commented May 10, 2021

calculuschild commented May 10, 2021 • edited

calculuschild commented May 11, 2021

calculuschild commented May 11, 2021 • edited

calculuschild commented May 11, 2021 • edited

UziTech commented May 12, 2021

UziTech commented Jun 1, 2021

calculuschild commented Jun 1, 2021 • edited

UziTech commented Jun 1, 2021

calculuschild commented Jun 1, 2021 • edited

UziTech commented Jun 1, 2021

calculuschild commented Jun 1, 2021

UziTech left a comment

Choose a reason for hiding this comment

calculuschild commented Jun 10, 2021

joshbruce commented Jun 13, 2021

github-actions bot commented Jun 15, 2021

priyeshs11 Jun 16, 2021 • edited

Choose a reason for hiding this comment

priyeshs11 Jun 16, 2021

Choose a reason for hiding this comment

calculuschild commented May 8, 2021 •

edited

vercel bot commented May 8, 2021 •

edited

UziTech commented May 8, 2021 •

edited

calculuschild commented May 8, 2021 •

edited

calculuschild commented May 8, 2021 •

edited

calculuschild commented May 9, 2021 •

edited

calculuschild commented May 10, 2021 •

edited

calculuschild commented May 11, 2021 •

edited

calculuschild commented May 11, 2021 •

edited

calculuschild commented Jun 1, 2021 •

edited

calculuschild commented Jun 1, 2021 •

edited

priyeshs11 Jun 16, 2021 •

edited