Turns a text like Mary had a little lamb.
into an array of tokens:
[
{ position: [ 0, 4 ], index: 0, type: 'word', value: 'Mary' },
{ position: [ 4, 5 ], index: 1, type: 'space', value: ' ' },
{ position: [ 5, 8 ], index: 2, type: 'word', value: 'had' },
{ position: [ 8, 9 ], index: 3, type: 'space', value: ' ' },
{ position: [ 9, 10 ], index: 4, type: 'word', value: 'a' },
{ position: [ 10, 11 ], index: 5, type: 'space', value: ' ' },
{ position: [ 11, 17 ], index: 6, type: 'word', value: 'little' },
{ position: [ 17, 18 ], index: 7, type: 'space', value: ' ' },
{ position: [ 18, 22 ], index: 8, type: 'word', value: 'lamb' },
{ position: [ 22, 23 ], index: 9, type: 'punctuation', value: '.' }
]
npm install positional-tokenizer --save
Positional tokenizer is preconfigured to tokenize words, spaces, punctuation and symbols.
import {Tokenizer, Token} from 'positional-tokenizer';
const text = "Mary had a little lamb.";
const tokenizer = new Tokenizer();
const tokens: Token[] = tokenizer.tokenize(text);
import {Tokenizer, TokenizeSeparator, TokenizeLetter} from 'positional-tokenizer';
// Define tokenization rules to capture words and spaces only
const rules = [
// Tokenize a single occurence of separator as spaaace
Tokenizer.ruleMono({spaaace: TokenizeSeparator.ALL}),
// Tokenize a consecutive sequence of letters as wooord
Tokenizer.ruleMulti({wooord: TokenizeLetter.ALL})
];
// Pass the rules to the tokenizer constructor
const tokenizer = new Tokenizer(rules);
Tokenizer exposes static
methods .ruleMono()
and .ruleMulti()
to compose the rules.
Tokenizer.ruleMono()
will capture a single occurrence of a token typeTokenizer.ruleMulti()
will capture a group of consecutive occurrences of a token type
Both methods accept a key-value pair of a token type and tokenization pattern to apply.
A rule can be described with the following interface:
type TokenizerRule = Record<TokenType, KnownRegexPatterns | RegExp>
type KnownRegexPatterns =
TokenizeLetter |
TokenizeMark |
TokenizeSeparator |
TokenizeSymbol |
TokenizeNumber |
TokenizePunctuation |
TokenizeOther |
TokenizeWord;
Each category comes with a set of predefined regex patterns. The categories described here are implemented with the corresponding unicode character sequences + have .ALL
prop for capturing all.
TokenizeWord.SIMPLE
matches a sequence of letters only (identical toTokenizeLetter.ALL
), capturing words likeI'm
,don't
anddevil-grass
as three tokens eachTokenizeWord.COMPLEX
matches a sequence of letters, dashes and apostrophes capturing words likeI'm
,don't
anddevil-grass
as a single token
// somewhere inside the tokenizer code
const DEFAULT_RULES = [
Tokenizer.ruleMulti({ word: TokenizeLetter.ALL }),
Tokenizer.ruleMono({ space: TokenizeSeparator.ALL }),
Tokenizer.ruleMono({ punctuation: TokenizePunctuation.ALL }),
Tokenizer.ruleMulti({ number: TokenizeNumber.ALL }),
Tokenizer.ruleMulti({ symbol: TokenizeSymbol.ALL })
];
You may compose rules using the regular expression of your choice.
import {Tokenizer, Token} from 'positional-tokenizer';
const text = "Des Teufels liebstes Möbelstück ist die lange Bank.";
const tokenizer = new Tokenizer([
Tokenizer.ruleMono({period: new RegExp('\\.')}),
Tokenizer.ruleMono({umlaut: new RegExp('[öüä]')}),
]);
const tokens: Token[] = tokenizer.tokenize(text);
import {Tokenizer, Token, TokenizeSeparator, TokenizeLetter} from 'positional-tokenizer';
const text = "Mary had a little lamb.";
const tokenizer = new Tokenizer([
Tokenizer.ruleMulti({word: TokenizeLetter.ALL}),
Tokenizer.ruleMono({space: TokenizeSeparator.ALL})
]);
const tokens: Token[] = tokenizer.tokenize(text);
import {Tokenizer, Token, TokenizeNumber} from 'positional-tokenizer';
const text = "Mary had 12 little lambs.";
const tokenizer = new Tokenizer([
Tokenizer.ruleMulti({number: TokenizeNumber.ALL})
]);
const tokens: Token[] = tokenizer.tokenize(text);
import {Tokenizer, Token, TokenizeSeparator, TokenizeWord, TokenizePunctuation} from 'positional-tokenizer';
const text = "Mary's had a little-beetle.";
const tokenizer = new Tokenizer([
Tokenizer.ruleMulti({word: TokenizeWord.SIMPLE}),
Tokenizer.ruleMono({space: TokenizeSeparator.ALL}),
Tokenizer.ruleMono({punct: TokenizePunctuation.ALL})
]);
const tokens: Token[] = tokenizer.tokenize(text); // tokens.length === 12
import {Tokenizer, Token, TokenizeSeparator, TokenizeWord, TokenizePunctuation} from 'positional-tokenizer';
const text = "Mary's had a little-beetle.";
const tokenizer = new Tokenizer([
Tokenizer.ruleMulti({word: TokenizeWord.COMPLEX}),
Tokenizer.ruleMono({space: TokenizeSeparator.ALL}),
Tokenizer.ruleMono({punct: TokenizePunctuation.ALL})
]);
const tokens: Token[] = tokenizer.tokenize(text); // tokens.length === 8
Creates a new instance of tokenizer with optional rules.
Tokenizes a text into an array of tokens.
Position of the token in the text.
Index of the token in the text.
Type of the token.
Value of the token.
Returns a string representation of the token.
Returns a JSON representation of the token.