Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuestionL Handling Comma Separated List Of Ids Per Row #4487

Closed
mayoatte opened this issue Nov 19, 2019 · 9 comments
Assignees

Comments

@mayoatte
Copy link

@mayoatte mayoatte commented Nov 19, 2019

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 3.0.0

Issue

**I'm trying to use the FieldAwareFactorizationMachine for classification. One of my most important features is a list of comma separated ids e.g. "3499430, 3499435, 34995430" (these are ids of items in a shopping cart) that are passed in (as a single column) with each row in the dataset.

I'm struggling with the right transformations to use on this column so that the feature can have the proper effect during training. So far i've only been able to use a OneHotHashEncoding but i'm not sure if that's right. Seems like i should be splitting up the list and converting the values to keys and then to vectors but the resulting vectors do not have fixed sizes (which FFM requires).

The general idea is to classify if some other item goes with this shopping cart.

Can you please help with some ideas on how to proceed?**

@ashbhandare

This comment has been minimized.

Copy link
Contributor

@ashbhandare ashbhandare commented Nov 19, 2019

@mayoatte Thank you for your question. You would definitely need to split the IDs. Do you mean that each row might have different number of item IDs resulting in variable sized vectors?

Do you have a fixed vocabulary of items? If so, you could consider converting the features into a sort of dictionary, with each field holding the count of items of that type.

@mayoatte

This comment has been minimized.

Copy link
Author

@mayoatte mayoatte commented Nov 20, 2019

@ashbhandare Thank you for the fast response.

Yes, each row has a different number of item IDs. which results in the variable sized vectors you mentioned (FFM algorithm does not like that).
image

The vocabulary of item IDs is pretty large (tens of thousands) since it is made of every item ID that has ever been ordered.

I can split the row into an integer array before starting the transformer pipeline (e.g. with a readonly property on the model) but i'm not sure what to do with the array after that. The transformers don't appear to accept arrays as input so i can't figure out how to manipulate the column to even try the suggestion you had.

Thoughts?

@justinormont

This comment has been minimized.

Copy link
Member

@justinormont justinormont commented Nov 20, 2019

You can use text processing transforms to produce a bag-of-words w/ unigrams & a separator of ",". This will create a boolean feature for each value of the your ItemIDs.

The feature will be boolean (0.0/1.0) if a ItemID doesn't repeat in current row, and a count (0.0, 1.0, 2.0, ...) otherwise.

You can manually do the steps of the FeaturizeText:

var pipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "ItemIDs", separators: new[] { ',' })
    .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
    .Append(mlContext.Transforms.Text.ProduceNgrams("FeaturesText", "Tokens"));

You may also want to append a NormalizeLpNorm() which downscales the weight of each ItemID when there are many ItemIDs.


Ideally, the options for FeaturizeText would expose its separators field in its options:

var options = new TextFeaturizingEstimator.Options() { WordFeatureExtractor = new WordBagEstimator.Options() { NgramLength = 1, separators: new[] { ',' } }, CharFeatureExtractor = null };

var pipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "FeaturesText", options, inputColumnName: "ItemIDs");`

Manually using the individual transforms has the usability concern that users won't know the output of each and which to chain together to accomplish otherwise easy tasks. Following the options of a single transformer is a more user friendly style by laying out the valid choices and encouraging the users to explore the options.

@mayoatte

This comment has been minimized.

Copy link
Author

@mayoatte mayoatte commented Nov 21, 2019

@justinormont Thank you for the fast & detailed response! I really appreciate it.

This works (thanks again) and avoids the variable length vector issue i was having. I wanted to ask a *followup question about if my approach (with your addition) is really the best way of integrating the "cart" (list of itemids) feature into the training.

For some context:

I'm building a recommendation engine that recommends k upsell items to a user at checkout time. My high level approach is to frame this as a classification problem at the individual upsell item level i.e. will this upsell item be purchased based on the contents of the cart. At prediction time, i loop through all the potential upsell items (small list) and predict if they will be purchased with the cart. I recommend the top k upsell items.

To train, i'm using the FieldAwareFactorizationMachine.

Now some questions:

I want the model to learn the interactions between upsell item and cart items both individually (e.g.
upsellitem 3 to cartitem 5) and collectively (upsellitem 4 to cartitem 6, cartitem 8 & cartitem9). That way it

  • Is this actually a viable approach with the FieldAwareFactorizationMachine? If not do you recommend an alternatives?
  • My initial plan for the cart was to create a vector that represented every possible cart item and use that to represent the cart but i had concerns about the size (thousands of cart items). What do you think of this approach? Does the ngram approach serve as a suitable duplicate?
  • Any other thoughts?

Thanks again for the help.

@justinormont

This comment has been minimized.

Copy link
Member

@justinormont justinormont commented Nov 21, 2019

... is really the best way of integrating the "cart" (list of itemids) feature into the training.

Treating the ItemIDs as text is a good method. There's always more things to try.

AutoML:
I'd recommend trying the AutoML for choosing your model for your classification. If you change your ItemIDs column to be space separated, the AutoML will pick it up and featurize it as text. You can also add your featurizer code as a prefeaturizer/preprocessor for AutoML.

Recommendation task:
You may want to run as a recommendation model instead of classification. ML.NET's AutoML recently added support for this. Posing as a ranking task may be fruitful too.


Answers to questions:

Is this actually a viable approach with the FieldAwareFactorizationMachine? If not do you recommend an alternatives?

Yes, it should work. Along with the recommendation task suggested above, you can also train a multi-class model. When creating the training set, I assume you're creating leave-on-out style where you remove one of the purchased items and make it the label (or for binary classification the extra feature). Ensure you use a SamplingKeyColumn to ensure that duplicated/replicated rows from the same customer/purchase end up in the same dataset split (or leakage can occur).

My initial plan for the cart was to create a vector that represented every possible cart item and use that to represent the cart but i had concerns about the size (thousands of cart items). What do you think of this approach? Does the ngram approach serve as a suitable duplicate?

The ngram based approach does this work for you, and will produce a feature slot for every item. The size is not an issue, though you can try a SelectFeaturesBasedOnCount() to remove infrequently purchased items.

Any other thoughts?

I'd start w/ AutoML:
https://dotnet.microsoft.com/learn/ml-dotnet/get-started-tutorial/intro

Plenty of options to try. For instance, training a word embedding model in fastText, then bringing it into the Word Embedding transform. This would learn the relationship (semantic meaning) between the ItemIDs instead of treating them as opaque tokens.

You may also want to expand your input features:

  • Item features: item title, description, category of each item purchased, cost, popularity
  • Purchase features: date (hour of the day, day of the week, etc), number of items purchased, average/stddev of cost in the basket
  • User history features: average/stddev user item cost, number of purchases in the last month, user location, number of times user purchased each item previously, duration since last purchase of the item
@mayoatte

This comment has been minimized.

Copy link
Author

@mayoatte mayoatte commented Nov 21, 2019

@justinormont Thank you so much, Justin! This is a wealth of information.

I'll be sure to check out AutoML, i'm glad it now supports recommendation).

**re: answers"

I assume you're creating leave-on-out style where you remove one of the purchased items and make it the label (or for binary classification the extra feature)

Yes, that's what we do for binary classification. There's a column for the upsell item that was purchased and another column for the other items that were purchased (the cart). The label is driven by the upsell item purchased.

Ensure you use a SamplingKeyColumn to ensure that duplicated/replicated rows from the same customer/purchase end up in the same dataset split (or leakage can occur

We do use the SamplingKeyColumn but we only set it to the customer id. Do we need to go lower (purchase level) or is that an either or choice.

The ngram based approach does this work for you, and will produce a feature slot for every item

That's great news.

Thank you for the other suggestions as well!

@justinormont

This comment has been minimized.

Copy link
Member

@justinormont justinormont commented Nov 21, 2019

We do use the SamplingKeyColumn but we only set it to the customer id. Do we need to go lower (purchase level) or is that an either or choice.

Basing on CustomerID is a stronger and better method than the PurchaseID. Nice idea.

Feel free to follow-up with which techniques showed value.

@ashbhandare

This comment has been minimized.

Copy link
Contributor

@ashbhandare ashbhandare commented Nov 25, 2019

@mayoatte If you found what you were looking for, are we good to close this issue?

@mayoatte

This comment has been minimized.

Copy link
Author

@mayoatte mayoatte commented Nov 26, 2019

@ashbhandare I did find what I was looking for. Thanks to you and Justin for all the help. Will close the issue.

@mayoatte mayoatte closed this Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.