Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is awkward to turn off char-grams with FeaturizeText #2946

Closed
rogancarr opened this Issue Mar 13, 2019 · 2 comments

Comments

Projects
2 participants
@rogancarr
Copy link
Contributor

rogancarr commented Mar 13, 2019

FeaturizeText was upgraded to allow specification of n-grams for words and characters. However, now it awkward to use FeaturizeText without specifying n-grams. It is now necessary to explicitly set CharFeatureExtractor as null.

This is how to compose a bag-of-words with the current API:

var pipeline = mlContext.Transforms.Text.FeaturizeText(
    "Features",
    new TextFeaturizingEstimator.Options
    {
        KeepPunctuations = false,
        OutputTokens = true,
        CharFeatureExtractor = null,
        WordFeatureExtractor = new WordBagEstimator.Options { NgramLength = 1},
        VectorNormalizer = TextFeaturizingEstimator.NormFunction.None
    },
    "SentimentText");

I would expect to be able to do something like

CharFeatureExtractor = new WordBagEstimator.Options { NgramLength = 0},

But this throws an error that Skipgrams is not less-than NgramLength, and Skipgrams must be positive.

Overall, it is a bit awkward and not obvious that you have to manually null a option. Is this the API we want to ship in v1.0?

@rogancarr rogancarr added the api label Mar 13, 2019

@wschin

This comment has been minimized.

Copy link
Member

wschin commented Mar 13, 2019

What is char-level tokenization? My impression is a process to generate ['a', 'b', 'c'] out of "abc". Also, I personally consider ['a', 'b', 'c'] as 1-grams. Therefore, char-level tokenization is valid only if NgramLength is greater than 1 (precisely equal to 1), and we'd better throw when NgramLength=0. Unfortunately, I don't have another solution to make disabling char-level tokenization easier.. @zeahmed, any comment?

@rogancarr rogancarr self-assigned this Mar 13, 2019

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Mar 13, 2019

@wschin Yes, your definition of char-level tokenization is correct.

I'm not sure how to make it easier, unless we add back the flags, which is cludge-y.

I think adding a note to the summary will be enough. It will show up in the tooltip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.