It is awkward to turn off char-grams with FeaturizeText #2946

rogancarr · 2019-03-13T20:44:38Z

FeaturizeText was upgraded to allow specification of n-grams for words and characters. However, now it awkward to use FeaturizeText without specifying n-grams. It is now necessary to explicitly set CharFeatureExtractor as null.

This is how to compose a bag-of-words with the current API:

var pipeline = mlContext.Transforms.Text.FeaturizeText(
    "Features",
    new TextFeaturizingEstimator.Options
    {
        KeepPunctuations = false,
        OutputTokens = true,
        CharFeatureExtractor = null,
        WordFeatureExtractor = new WordBagEstimator.Options { NgramLength = 1},
        VectorNormalizer = TextFeaturizingEstimator.NormFunction.None
    },
    "SentimentText");

I would expect to be able to do something like

CharFeatureExtractor = new WordBagEstimator.Options { NgramLength = 0},

But this throws an error that Skipgrams is not less-than NgramLength, and Skipgrams must be positive.

Overall, it is a bit awkward and not obvious that you have to manually null a option. Is this the API we want to ship in v1.0?

The text was updated successfully, but these errors were encountered:

wschin · 2019-03-13T21:12:04Z

What is char-level tokenization? My impression is a process to generate ['a', 'b', 'c'] out of "abc". Also, I personally consider ['a', 'b', 'c'] as 1-grams. Therefore, char-level tokenization is valid only if NgramLength is greater than 1 (precisely equal to 1), and we'd better throw when NgramLength=0. Unfortunately, I don't have another solution to make disabling char-level tokenization easier.. @zeahmed, any comment?

rogancarr · 2019-03-13T21:59:15Z

@wschin Yes, your definition of char-level tokenization is correct.

I'm not sure how to make it easier, unless we add back the flags, which is cludge-y.

I think adding a note to the summary will be enough. It will show up in the tooltip.

rogancarr added the API Issues pertaining the friendly API label Mar 13, 2019

rogancarr self-assigned this Mar 13, 2019

rogancarr mentioned this issue Mar 13, 2019

FeaturizeText: Add instructions to turn off char- or word-gram generation to the tooltip. #2947

Merged

rogancarr closed this as completed in #2947 Mar 13, 2019

ghost locked as resolved and limited conversation to collaborators Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is awkward to turn off char-grams with FeaturizeText #2946

It is awkward to turn off char-grams with FeaturizeText #2946

rogancarr commented Mar 13, 2019 •

edited

Loading

wschin commented Mar 13, 2019

rogancarr commented Mar 13, 2019

It is awkward to turn off char-grams with FeaturizeText #2946

It is awkward to turn off char-grams with FeaturizeText #2946

Comments

rogancarr commented Mar 13, 2019 • edited Loading

wschin commented Mar 13, 2019

rogancarr commented Mar 13, 2019

rogancarr commented Mar 13, 2019 •

edited

Loading