Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize tokenizer via the fork API #264

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

scripthunter7
Copy link
Contributor

@scripthunter7 scripthunter7 commented Oct 22, 2023

Closes #253

This is a relatively simple PR that allows advanced library users to use a custom tokenizer via the fork API. This PR doesn't change how the base library works, it only affects the forks and makes them even more flexible.

Custom tokenizer function can be a completely new tokenizer or a simple wrapper around CSSTree's tokenizer, the point is that it should meet the following requirements:

  1. It should be compatible with the following signature:
    /**
     * CSSTree's tokenizer signature
     *
     * @param source CSS source code to tokenize
     * @param onToken Callback which will be invoked when a token is found
     */
    function tokenize(source: string, onToken: (tokenType: number, startOffset: number, endOffset: number) => void): void;
  2. It should use with similar token IDs as CSSTree does: https://github.com/csstree/csstree/blob/master/lib/tokenizer/types.js (actually, this is quite natural behaviour, since these are tokens defined by the official specs).

Example usage:

import * as cssTree from 'css-tree';

const customTokenize = function(source, onToken) {
    // ...
};

const cssTreeFork = cssTree.fork({
    // Use the customized tokenizer
    tokenize: customTokenize,
});

@lahmatiy I think It would be worth making a documentation about the fork API. If you think so, if I have some free time, I will be happy to help you make a basic one in a different PR. These requirements for the custom tokenizer should also be described there.

@coveralls
Copy link

coveralls commented Oct 22, 2023

Coverage Status

coverage: 98.869% (+0.003%) from 98.866% when pulling 4659880 on scripthunter7:feature/253 into ba6dfd8 on csstree:master.

@lahmatiy
Copy link
Member

@scripthunter7 Thank you for the PR! I'm supportive of this extension. However, there are a few points to address:

  1. If we're introducing a custom tokenizer, it should be consistently used throughout the entire functionality. Currently, it seems to be missing in following modules (maybe in some others):
    tokenize(str, (type, start, end) =>
    tokenize(chunk, (type, start, end) => {
  2. We would need unit tests to ensure the correct functioning of this feature.

Regarding documentation for the fork API, I absolutely agree. I would appreciate it if you could propose a PR to lay the groundwork. This would also be a good place to detail the requirements for the custom tokenizer, as you've outlined.

@scripthunter7 scripthunter7 marked this pull request as draft October 23, 2023 20:49
@scripthunter7 scripthunter7 marked this pull request as ready for review October 24, 2023 13:25
@scripthunter7
Copy link
Contributor Author

@lahmatiy Thank you for your feedback! I think I made the tokenizer switchable everywhere (I hope). Finally, I introduced a separate utility that returns the tokenizer function from the config object, if it is present. I also made some simple unit tests to make sure that the custom tokenizer is used in the fork, but it does not affect the operation of the default library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Customizable tokenizer in the fork API
3 participants