Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is awkward to turn off char-grams with FeaturizeText #2946

Closed
rogancarr opened this issue Mar 13, 2019 · 2 comments
Closed

It is awkward to turn off char-grams with FeaturizeText #2946

rogancarr opened this issue Mar 13, 2019 · 2 comments
Assignees
Labels
API Issues pertaining the friendly API

Comments

@rogancarr
Copy link
Contributor

rogancarr commented Mar 13, 2019

FeaturizeText was upgraded to allow specification of n-grams for words and characters. However, now it awkward to use FeaturizeText without specifying n-grams. It is now necessary to explicitly set CharFeatureExtractor as null.

This is how to compose a bag-of-words with the current API:

var pipeline = mlContext.Transforms.Text.FeaturizeText(
    "Features",
    new TextFeaturizingEstimator.Options
    {
        KeepPunctuations = false,
        OutputTokens = true,
        CharFeatureExtractor = null,
        WordFeatureExtractor = new WordBagEstimator.Options { NgramLength = 1},
        VectorNormalizer = TextFeaturizingEstimator.NormFunction.None
    },
    "SentimentText");

I would expect to be able to do something like

CharFeatureExtractor = new WordBagEstimator.Options { NgramLength = 0},

But this throws an error that Skipgrams is not less-than NgramLength, and Skipgrams must be positive.

Overall, it is a bit awkward and not obvious that you have to manually null a option. Is this the API we want to ship in v1.0?

@rogancarr rogancarr added the API Issues pertaining the friendly API label Mar 13, 2019
@wschin
Copy link
Member

wschin commented Mar 13, 2019

What is char-level tokenization? My impression is a process to generate ['a', 'b', 'c'] out of "abc". Also, I personally consider ['a', 'b', 'c'] as 1-grams. Therefore, char-level tokenization is valid only if NgramLength is greater than 1 (precisely equal to 1), and we'd better throw when NgramLength=0. Unfortunately, I don't have another solution to make disabling char-level tokenization easier.. @zeahmed, any comment?

@rogancarr rogancarr self-assigned this Mar 13, 2019
@rogancarr
Copy link
Contributor Author

@wschin Yes, your definition of char-level tokenization is correct.

I'm not sure how to make it easier, unless we add back the flags, which is cludge-y.

I think adding a note to the summary will be enough. It will show up in the tooltip.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

No branches or pull requests

2 participants