Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behavior for TokenizeChar and TokenizeWords #2864

Closed
Ivanidzo4ka opened this issue Mar 5, 2019 · 2 comments
Closed

Different behavior for TokenizeChar and TokenizeWords #2864

Ivanidzo4ka opened this issue Mar 5, 2019 · 2 comments

Comments

@Ivanidzo4ka
Copy link
Contributor

TokenizeChar produce vector of keys.
TokenizeWords produce vector of strings.
I have to add MapValueToKey to TokenizeWords in order to apply ProduceNgrams to it.

@TomFinley
Copy link
Contributor

TomFinley commented Mar 6, 2019

This is intentional and as designed. The set of possible characters (in the sense of possible values of char) is finite and well defined. This allows tokenization of characters to forego being a trained transform, if we like, and skip the necessary step of building dictionaries or hashing, or whatever, which remains a requirement of words. The set of possible words is not as finite and easy to enumerate, and therefore we need further processing to turn them into something useful for, say, n-gram processing to take ahold of. So, in the process of taking text to terms to keys, with character we are able to skip and step and have what becomes a far more efficient pipeline; good for us! But we can't skip that step with words.

@TomFinley
Copy link
Contributor

Hi @Ivanidzo4ka, if this is good enough, should we close?

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants