Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeaturizeText outputTokens uses a magical string to name a new column #2957

Closed
rogancarr opened this Issue Mar 14, 2019 · 16 comments

Comments

5 participants
@rogancarr
Copy link
Contributor

rogancarr commented Mar 14, 2019

When using OutputTokens=true, FeaturizeText creates a new column called ${OutputColumnName}_TransformedText. This isn't really well documented anywhere, and it's odd behavior. I suggest that we make the tokenized text column name explicit in the API.

My suggestion would be the following:

  • Change OutputTokens = [bool] to OutputTokensColumn = [string], and a string.NullOrWhitespace(OutputTokensColumn) signifies that this column will not be created.

What do you all think?

@rogancarr rogancarr self-assigned this Mar 14, 2019

@rogancarr rogancarr added the api label Mar 14, 2019

@rogancarr rogancarr removed their assignment Mar 14, 2019

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Mar 14, 2019

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Mar 14, 2019

I would vote to hide that option in general and switch our examples to chain of estimators like we do here:

var wordsPipeline = ml.Transforms.Text.NormalizeText("NormalizedText", "SentimentText", keepDiacritics: false, keepPunctuations: false)

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Mar 14, 2019

@Ivanidzo4ka I agree with what you're saying, but FeaturizeText is a "grab-bag" transform that does a bunch of stuff in one go, so unless we drop it from the APIs, we should fix this directly.

Update: I see what you're saying. You are saying to keep the transform, but turn off the "introspection" pieces.

@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Mar 14, 2019

Exploding the text transform in to it's subparts will make the WordEmbedding quite error prone to use. For instance, users will often forget to lowercase their text, causing a silent degradation as it will still semi-work. I strongly prefer the explorability and simplicity of using a single transform instead of the seven or so sub-components.

On the original topic of creating an explicit tokens column:
Does the API allow for multiple columns to be handled independently? Eg: Title => NgramsTitle and Description => NgramsTitle? If so we also produce NgramsTitle_TransformedText & NgramsTitle_TransformedText; reducing to just token means we can only handle one input column.

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Mar 14, 2019

So we have this chunky transform which do lot of stuff, and have plenty of knobs. One of this knobs is OutputTokens. It's involves half of other knobs, and independent from extractors and normalization part.

I see main purpose of FeaturizeText transform to, you know, take text and produce something we can consume in trainers, which is vector of floats. And I like then transform or function in general does one thing.

OutputTokens for me feels like some dangling peace of functionality which we added because we could add it, but unnecessary because we should add it.

I understand it can be error prone to convert text into tokens, but no one stops us from introducing another transform which would do that (and inside would be same just chain of already existing transformers) and only that, rather than having one transform which does two things.

To answer @justinormont question regarding current behavior for multiple columns -> We just concat them together into one column and process it as one column.

@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Mar 14, 2019

Another reason it's nice for the FeaturizeText to produce the tokens: the winning combo is generally keeping both the ngrams & the word embedding. Not just the word embedding by itself.

@abgoswam

This comment has been minimized.

Copy link
Member

abgoswam commented Mar 15, 2019

As suggested by @Ivanidzo4ka if users want to extract tokens as part of the pipeline , we can point them to use the TokenizeIntoWords featurizer estimator (which is an API designed specifically for that task)

/// <summary>
/// Tokenizes incoming text in <paramref name="inputColumnName"/>, using <paramref name="separators"/> as separators,
/// and outputs the tokens as <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="separators">The separators to use (uses space character by default).</param>
public static WordTokenizingEstimator TokenizeIntoWords(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,
string inputColumnName = null,
char[] separators = null)

For V1.0 I feel it makes sense to hide the OutputTokens option from the FeaturizeText API.

p.s. @rogancarr @Ivanidzo4ka @justinormont let me know if we have consensus on this -- i can get going on this issue

@abgoswam abgoswam self-assigned this Mar 15, 2019

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Mar 15, 2019

If we add a separate step for outputting tokens, then that is actually a completely different pass through the data to create tokens that we already have. I would recommend keeping the output option, but letting the user specify a name.

@abgoswam

This comment has been minimized.

Copy link
Member

abgoswam commented Mar 15, 2019

In other places of the API we have moved away from adding "smarts" (no-auto normalization, no-auto calibration). Should we adhere to this motto for FeaturizeText as well ?

If we expose OutputTokens in V1.0 we are fixing the behavior of FeaturizeText to always tokenize without giving us the flexibility to modify this behavior in the future.

@TomFinley

@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Mar 15, 2019

@abgoswam in addition, there's various more steps needed than just TokenizeIntoWords to get the ngrams to the form the pretrained word embeddings need (diacritics, lowercasing, sometimes stopwords).

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Mar 15, 2019

So is the concern that we may want FeaturizeText to not tokenize in future releases?

@abgoswam

This comment has been minimized.

Copy link
Member

abgoswam commented Mar 15, 2019

@rogancarr . Yeap that's my primary concern. Specifically, if a user uses the TokenizeIntoWords, followed by FeaturizeText .. do we promise to tokenize it again in FeaturizeText?

  • If yes, then we can keep the OutputTokens option inside FeaturizeText
  • If no, then we should hide this option for V1.0

@justinormont . Am not getting you.

  • "...to the form the pretrained word embeddings need" .. Which pretrained embeddings are you referring to word2vec, Glove ? I am not seeing how they relate to the FeaturizeText API
@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Mar 15, 2019

@abgoswam: Ahh. Ok.
I know of two uses of the <colname>_FeaturizedText output:

  1. The WordEmbeddings estimator takes in the munged/cleaned tokens from FeaturizeText as its input (see: example)
  2. For debugging purposes to see how the FeaturizeText is munging/cleaning their input text
@abgoswam

This comment has been minimized.

Copy link
Member

abgoswam commented Mar 15, 2019

@justinormont Thanks for the clarification.

I created a small sample for this. Kindly take a look at this example #2985

  1. It uses ApplyWordEmbedding as a step in the pipeline to get the SentimentEmbeddingFeatures . Note, having <colname>_FeaturizedText is not a requirement .

  2. FeaturizeText is doing much more than what is supposed to do. Hiding OutputTokens gives us a chance to fix some of its behavior in the future. The same holds true for other "smarts" it does behind the scenes (e.g. lower casing etc).

  3. The example in #2985 also highlights the scenario described above #2957 (comment)

@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Mar 16, 2019

@abgoswam: Your example #2985 rather misses the reason you want the FeaturizeText before the ApplyWordEmbedding.

The word embedding needs to run thru the steps in the FeaturizeText. You need to match the pre-trained wordembeddings. For instance you have to match the casing of the fastText (and other) models or your word lookup will fail.

For instance all of our, included, pre-trained word embeddings are lowercase. There is no "Cat" in the model, but there is a "cat". Beyond the hard failure of case, for most pre-trained word embeddings, having the FeaturizeText remove punctuation, diacritics, and stop words will lead to gains.

@abgoswam

This comment has been minimized.

Copy link
Member

abgoswam commented Mar 16, 2019

One of the invariants of the FeaturizeText API is that it works on tokens, and this is irrespective of the "smarts" built into the API or presence of another tokenizing API TokenizeIntoWords in ML.NET

As such, the recommendation by @rogancarr above #2957 (comment) seems reasonable.

Here is the proposal:

  • Rename OutputTokens to OutputTokensColumnName
  • OutputTokensColumnName is of type string . We let users specify the name of the column.
    • If null or empty, we do not create the tokens column .
      • This behavior would be similar to setting OutputTokens=false in existing code.
    • Otherwise, create tokens column.
      • This behavior would be similar to setting OutputTokens=true in existing code, except that we will generate column name OutputTokensColumnName instead creating column with magic string (*_TransformedText) that we do currently.

@abgoswam abgoswam added this to In progress in Project 13 Mar 18, 2019

@shauheen shauheen added this to the 0319 milestone Mar 18, 2019

@shauheen shauheen added this to In progress in v0.12 Mar 18, 2019

Project 13 automation moved this from In progress to Done Mar 19, 2019

v0.12 automation moved this from In progress to Done Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.