Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Custom Tokenizer/Renderer extensions #2043

Merged
merged 49 commits into from Jun 15, 2021

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented May 8, 2021

Description

New attempt at custom Tokens, Renderers for Marked.js. Faster than #1872.

May be a possible fix for: #1373, #1693, #1695, #2061,

Users can add a custom Tokenizer and Renderer which will be executed without needing to overwrite existing components, with the extension in the following format:

const myExtension = {
  extensions: [{
    name: 'underline',
    start(src) { return src.match(/:/)?.index; },  // First characters of your token so Marked.js knows to stop and check for a match
    level: 'block',        // Is this a block-level or inline-level tokenizer?
    tokenizer(src, tokens) {
      const rule = /^:([^\n]*)(?:\n|$)/;  // Regex for the complete token
      const match = rule.exec(src);
      if (match) {
        return {                          // Token to generate
          type: 'underline',                // Should match "name" above
          raw: match[0],                    // The text that you want your token to consume from the source
          text: match[1].trim()             // Any custom properties you want the Renderer to access
        };
      }
    },
    renderer (token) {
      return `<u>${token.text}</u>\n`;
    }]
};

The extension(s) can then be loaded like so:

marked.use(myExtension, extension2, extension3);

\\ EQUIVALENT TO:

marked.use(myExtension);
marked.use(extension2);
marked.use(extension3);

Benchmarks (on my laptop):

Master from Feb 2021
es5 marked completed in 4260ms and passed 82.74%
es6 marked completed in 4289ms and passed 82.74%
es5 marked (gfm) completed in 4518ms and passed 82.13%
es6 marked (gfm) completed in 4504ms and passed 82.13%
es5 marked (pedantic) completed in 4596ms and passed 61.48%
es6 marked (pedantic) completed in 4783ms and passed 61.48%
commonmark completed in 3617ms and passed 100.00%
markdown-it completed in 3646ms and passed 89.21%
Master (current)
es5 marked completed in 4504ms and passed 86.90%
es6 marked completed in 4622ms and passed 86.90%
es5 marked (gfm) completed in 4776ms and passed 86.29%
es6 marked (gfm) completed in 5029ms and passed 86.29%
es5 marked (pedantic) completed in 4674ms and passed 71.03%
es6 marked (pedantic) completed in 5007ms and passed 71.03%
commonmark completed in 3678ms and passed 100.00%
markdown-it completed in 3672ms and passed 89.21%
This PR
es5 marked completed in 4761ms and passed 86.90%
es6 marked completed in 4814ms and passed 86.90%
es5 marked (gfm) completed in 5115ms and passed 86.29%
es6 marked (gfm) completed in 5105ms and passed 86.29%
es5 marked (pedantic) completed in 4972ms and passed 71.03%
es6 marked (pedantic) completed in 5069ms and passed 71.03%
commonmark completed in 3619ms and passed 100.00%
markdown-it completed in 3735ms and passed 89.21%
This PR with all extensions running at the top of the Lexer rather than using "before" to spread them out:
es5 marked completed in 4591ms and passed 86.90%
es6 marked completed in 4644ms and passed 86.90%
es5 marked (gfm) completed in 4891ms and passed 86.29%
es6 marked (gfm) completed in 4804ms and passed 86.29%
es5 marked (pedantic) completed in 4770ms and passed 71.03%
es6 marked (pedantic) completed in 4799ms and passed 71.03%
commonmark completed in 3617ms and passed 100.00%
markdown-it completed in 3587ms and passed 89.21%

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented May 8, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/GBXN7V5EfC83A6yEQUUynLXjXD2h
✅ Preview: https://markedjs-git-fork-calculuschild-markedextensions-markedjs.vercel.app

@UziTech
Copy link
Member

UziTech commented May 8, 2021

We should definitely only run through the extensions once and create a map like:

this.extensions = {
  paragraph: [
    underline
  ]
}

Then looking them up like:

runTokenzierExtension(src, tokens, before) {
  let tokensLength = 0;
  if (this.extensions[before]) {
    this.extensions[before].forEach(function(extension, index) {
      if (token = extension.tokenizer(src)) {
        src = src.substring(token.raw.length);
        tokens.push(token);
        tokensLength += token.raw.length;
      }
    });
  }
  return tokensLength;
}

src/Parser.js Outdated Show resolved Hide resolved
@calculuschild
Copy link
Contributor Author

calculuschild commented May 8, 2021

Used a similar method to your inlineText tokenizer on the paragraph tokenizer. Now if there are extensions, it looks for the "start" regex and cuts src down to that index before continuing on.

Although... this would make customizing the paragraph tokenizer much more difficult. Maybe this logic can go into the Lexer so the paragraph tokenizer remains easily editable.

@calculuschild
Copy link
Contributor Author

calculuschild commented May 8, 2021

@UziTech Is there a way to clear the extensions/options between unit tests?

Edit: Let's see... I've found marked.defaults = marked.getDefaults(); Is that the best way to do it?

Edit2: Bah. For some reason that causes tests in Parser-spec.js to fail in very strange ways even though I'm not touching that file at all. Text from the marked-spec.js is showing up in the results there for some reason? I wonder if the async tests are to blame. Commented out for now...

@calculuschild
Copy link
Contributor Author

calculuschild commented May 9, 2021

Added an "inline" version of my underline token and the extension works.

I was looking at the header-ids extension but I'm not sure how to best approach that. I see a few options:

  1. Overwrite the Heading renderer (can already do this)
  2. Overwrite the Heading tokenizer (can already do this)
  3. Insert a custom Block HeadingID extension which runs before the normal Heading tokenizer (simple enough with this PR)
  4. Have a custom Inline "ID" extension that we could apply more generally to whatever the parent token is (to add an id to an image, to a codespan, etc.) (not possible with this PR yet)

So this brings up some thoughts:

  • Do we want to change the syntax for overwriting existing renderers/tokenizers to somehow follow this new style? (i.e., an optional parameter overwrite : 'heading' instead of before : 'heading'.)
  • I would be interested in getting something like 4) above to work. There are existing markdown extensions that add some snippet of text inside of another block element to give it a particular style or html property, such as image sizing, or Pandoc's fenced divs which is an example of re-using the {#id} syntax from the header-ids extension in different block types, and it would be neat if it didn't require fully overwriting each block element you want to apply the snippet to. This would probably require passing the previous Token or parent token in was a parameter to the custom tokenizer but I'm not sure if that would be a good thing.
  • Are there any extra parameters that we obviously should pass in that our other tokenizers have needed? PrevChar, etc?

@UziTech
Copy link
Member

UziTech commented May 10, 2021

Is there a way to clear the extensions/options between unit tests?

You should add extensions: null to the defaults it should be reset between tests.

@UziTech
Copy link
Member

UziTech commented May 10, 2021

Do we want to change the syntax for overwriting existing renderers/tokenizers to somehow follow this new style? (i.e., an optional parameter overwrite : 'heading' instead of before : 'heading'.)

Good question. I'm not sure if it is better to have multiple ways to do the same thing. I think it would depend on whether it would slow down marked without extensions too much.

@calculuschild
Copy link
Contributor Author

it would depend on whether it would slow down marked without extensions too much.

I was imagining just a change in the Marked.js page, to handle a different extension format but we would parse it into the same marked.defaults.renderer or whatever so speed should be similar when executing.

I would probably be in favor of just having one way to do things, and I also like the idea of having consistency between how extensions are formatted.

@UziTech
Copy link
Member

UziTech commented May 10, 2021

Sounds good. We could add this way of extensions and deprecate the old way and remove it in v3 or v4

@UziTech
Copy link
Member

UziTech commented May 10, 2021

I would be interested in getting something like 4) above to work.

We could pass the parent token or something if it is not a top level token so the tokenizer or renderer could do different things depending on what type of token it is in.

@calculuschild
Copy link
Contributor Author

calculuschild commented May 10, 2021

We could add this way of extensions and deprecate the old way and remove it in v3 or v4

Ok, so if we do this and require each extension to be formatted as an object, what about supporting passing in both a single extension object or an array of extensions (rather than a nested object of extensions) so order is preserved for consistency. Like so:

extension1 = {
  name : "extension1",
  overwrite : "header",
  level: "block",
  tokenizer : (src) =>  {}
... }

extension2 = {
  name: "extension2",
  before: "paragraph",
  level: "block",
  tokenizer : (src) => {}
... }

marked.use(extension1);
marked.use(extension2);
// EQUIVALENT TO
marked.use([extension1, extension2]);

If an extension has "overwrite", treat it as we currently treat extensions, i.e., merge into marked.defaults.

Otherwise, merge into an extensions map depending on the contents of before.

@calculuschild
Copy link
Contributor Author

We should definitely only run through the extensions once and create a map like:

@UziTech I applied all the changes discussed above. Processing in marked.js got a bit ugly (and I'm probably missing special cases/error handling, which I am not experienced with) but the code elsewhere is much simplified now without the looping and results in an extensions object of the following format:

extensions {
  [beforeA] : [tokenizer1, tokenizer2],
  [beforeB] : [tokenizer3],
  'last' : [tokenizer4],
  [nameA] : renderer1,
  [nameB] : renderer2,
  startBlock : /regex1|regex2|regex3/,
  startInline : /regex4|regex5/

Overwriting extensions now require existence of the property overwrite : true, and they will be merged into defaults as before.

Custom extensions with Tokenizers will be added to the before keyword last by default, and will run after all of the default tokenizers, unless the user supplies a valid before property. A valid start value is needed for inline tokenizers to work properly, and must have start for block tokenizers only if they want to be able to interrupt paragraphs, a la tables, headers, etc.

Custom extensions with Renderers must have a name property.

start regexes are now merged into one to avoid looping and to ensure the first match index is chosen.

test/unit/marked-spec.js Outdated Show resolved Hide resolved
src/Lexer.js Outdated Show resolved Hide resolved
@calculuschild
Copy link
Contributor Author

calculuschild commented May 11, 2021

Added current benchmarks to the OP. Aw man... Not sure where the slowdown is coming from yet.

Edit: Ah... Most of it is because I was accidentally comparing to my (old, but faster) fork of the Master branch. I updated the original post comparing the four scenarios.

@calculuschild
Copy link
Contributor Author

calculuschild commented May 11, 2021

Now that I think about it.... Do we NEED a before parameter? What if we just have all the custom extensions run one after another very first thing to cut down on the number of calls we need sprinkled throughout the Lexer? I can't actually think of a case where a user would want to add a custom tokenizers that specifically runs after the others. We must have had at least one example of it before but I can't find it anywhere.

I have this working on my local branch and it actually removes most of the slowdown. If that sounds reasonable I can commit it.

@UziTech
Copy link
Member

UziTech commented May 12, 2021

Ya that sounds fine. If we need to allow running between other tokenizers we can add that later.

@UziTech
Copy link
Member

UziTech commented Jun 1, 2021

What do you think about passing in like a previousToken and nextToken to the walktokens?

I have thought about adding all of those things. The problem is: where do you stop? it could also be helpful to know what the parent token is ... and the grandparent token ... and all of the siblings.

I think the best thing to do is say walkTokens is simply a convenience function for walking each token and altering the tokens just based on that token. If someone wants to do more they can get the tokens and walk them however they want manually.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 1, 2021

they can get the tokens and walk them however they want manually.

How would a user go about doing that? Just accessing the Lexer?

The problem is: where do you stop?

Right... Ok, what if we do this instead: after a quick google, apparently we could get the index of the current token being walked with for [index, token] of tokens.entries(). Then we could pass in index and tokens which would give access to all siblings, and you know your position within the siblings. Children are already accessible. And parents/grandparents could be accessed by just walking them and checking if they contain the child token of interest, then manipulating that child. I think that gives access to anything in the token tree. You would just have to write your walktokens function to look sideways or down.

@UziTech
Copy link
Member

UziTech commented Jun 1, 2021

How would a user go about doing that? Just accessing the Lexer?

yes

parents/grandparents could be accessed by just walking them and checking if they contain the child token of interest

How is that easier than just accessing tokens from the lexer and walking them manually?

If you want extensions to be able to get tokens we should do something like hooks and just have a hook that gives the extension all tokens between lex and parse and let them do whatever they want. Same with a hook for pre and post processing.

For the gfm-heading-id extension it would be nice to be be able to reset the slugger in a preprocess hook instead of needing to call it manually.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 1, 2021

How is that easier than just accessing tokens from the lexer and walking them manually?

Because then the extension creator essentially has to re-write his own custom walkTokens logic and break apart the Marked pipeline to manually call the lexer, then run his custom walkTokens, then call the parser, instead of just using marked('my Markdown'). This is a headache if anyone else actually wants to use that extension, instead of just plugging into marked.use() and it "just works".

The preprocessor/postprocessor hooks would make it work with marked.use() so you don't have to break apart the pipeline, but it still requires users to build their own walkTokens from scratch which seems redundant given that we already have walkTokens.

Ok, so another approach: What if inside of walkTokens we inject two properties into the token immediately before calling the extension functions. A previousToken and a parentToken. Then each token can navigate up (token.parentToken.parentToken.parentToken), down (token.tokens.tokens), or backward to previous siblings (token.previousToken.previousToken). You could even get away with just the previousToken. And you just need to handle it like CSS where your selectors can only access the next element or child elements in the list, but in reverse (you can only access previous or child items, since next siblings wouldn't have been given their previousToken yet.)

So if you want to change a token between two paragraphs:

tokens = [
token1 {type: 'paragraph' ...},
token2 {type: 'table', previousToken: token1 ...}
token3 {type: 'paragraph', previousToken: token2 ...}
];

//WalkTokens
walkTokens(token) {
  if(token?.previousToken?.previousToken.type === 'paragraph' && token.type === 'paragraph') {
    token.previousToken.text += `I'm between two paragraphs!`;
  }
}

Essentially the only limitation with walkTokens as it is now, is accessing siblings. You can already do everything you need to do with parent/grandparent/child by just starting your function at the parent or grandparent and looking down the children.

@UziTech
Copy link
Member

UziTech commented Jun 1, 2021

Except down isn't always in tokens which means they would have to re-create the logic of walkTokens anyway. Again I don't see this being easier than walking manually. Also I think this PR has done enough for now. We should get this merged and improve it from there.

@calculuschild
Copy link
Contributor Author

Also I think this PR has done enough for now.

Fair enough. I'll work on that last change.

Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work. 👍 This is going to allow so many new things for marked.

@calculuschild
Copy link
Contributor Author

@styfle , @joshbruce , @davisjam

Have you guys had a chance to look at this PR? I know its a hefty one so we could sure use some extra eyes on it!

@joshbruce
Copy link
Member

I'd be lying if I said I understood it all. I appreciate that performance isn't hit too hard given this is something the community has been wanting directly (or by proxy) for a while now. It seems like at least a couple of members from the community appreciate it as well. So, I'm good.

One thing I'd like to put a pin in is the use of null - just saw it in a new line for this PR - probably something for later and not sure how we could remove the reference...kind of a personal drive I'm starting to get with my coding style.

@UziTech UziTech changed the title Custom Tokenizer/Renderer extensions feat: Custom Tokenizer/Renderer extensions Jun 15, 2021
@UziTech UziTech merged commit 5be9d6d into markedjs:master Jun 15, 2021
github-actions bot pushed a commit that referenced this pull request Jun 15, 2021
# [2.1.0](v2.0.7...v2.1.0) (2021-06-15)

### Features

* Custom Tokenizer/Renderer extensions ([#2043](#2043)) ([5be9d6d](5be9d6d))
@github-actions
Copy link

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

return ret;
};
if (ext.renderer) { // Renderer extensions
const prevRenderer = extensions.renderers?.[ext.name];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@UziTech @calculuschild I think this line is causing the error below. Will create a seperate issue too.

node_modules/protobufjs/node_modules/marked/src/marked.js:158
const prevRenderer = extensions.renderers?.[ext.name];
                                          ^
SyntaxError: Unexpected token '.'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue here: #2107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants