Different behavior for TokenizeChar and TokenizeWords #2864

Ivanidzo4ka · 2019-03-05T23:39:49Z

TokenizeChar produce vector of keys.
TokenizeWords produce vector of strings.
I have to add MapValueToKey to TokenizeWords in order to apply ProduceNgrams to it.

TomFinley · 2019-03-06T17:34:04Z

This is intentional and as designed. The set of possible characters (in the sense of possible values of char) is finite and well defined. This allows tokenization of characters to forego being a trained transform, if we like, and skip the necessary step of building dictionaries or hashing, or whatever, which remains a requirement of words. The set of possible words is not as finite and easy to enumerate, and therefore we need further processing to turn them into something useful for, say, n-gram processing to take ahold of. So, in the process of taking text to terms to keys, with character we are able to skip and step and have what becomes a far more efficient pipeline; good for us! But we can't skip that step with words.

TomFinley · 2019-03-07T20:11:04Z

Hi @Ivanidzo4ka, if this is good enough, should we close?

TomFinley added the answered label Mar 7, 2019

Ivanidzo4ka closed this as completed Mar 7, 2019

Ivanidzo4ka mentioned this issue Mar 11, 2019

Polish char- and word-level tokenizers & stopword removers #2916

Merged

ghost locked as resolved and limited conversation to collaborators Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different behavior for TokenizeChar and TokenizeWords #2864

Different behavior for TokenizeChar and TokenizeWords #2864

Ivanidzo4ka commented Mar 5, 2019

TomFinley commented Mar 6, 2019 •

edited

TomFinley commented Mar 7, 2019

Different behavior for TokenizeChar and TokenizeWords #2864

Different behavior for TokenizeChar and TokenizeWords #2864

Comments

Ivanidzo4ka commented Mar 5, 2019

TomFinley commented Mar 6, 2019 • edited

TomFinley commented Mar 7, 2019

TomFinley commented Mar 6, 2019 •

edited