Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using text and sparse input in TensorFlow #747

Closed
yaeldekel opened this issue Aug 27, 2018 · 6 comments

Comments

Projects
4 participants
@yaeldekel
Copy link
Member

commented Aug 27, 2018

We should know how TF handles text inputs, and whether it supports sparse inputs.

@yaeldekel yaeldekel added this to To Do in DNN and ONNX scoring in ML.NET via automation Aug 27, 2018

@shauheen shauheen added this to To Do in v0.6 Aug 31, 2018

@zeahmed zeahmed self-assigned this Sep 6, 2018

@zeahmed zeahmed moved this from To Do to In Progress in DNN and ONNX scoring in ML.NET Sep 6, 2018

@zeahmed

This comment has been minimized.

Copy link
Member

commented Sep 12, 2018

Most of the text models in TensorFlow (and in any other DNN platform in general) uses an embedding layer to handle text. This is against the bag-of-word model approach where a vector is formed for the words/characters in the text. The indices of the vector refer to the words/characters and the values represent the TF/TF-IDF or any other scores computed for words/characters.

The bag-of-word model requires vectors to be represented in sparse format because number of words/characters appearing in the text is very large. However, when using models with embedding layers, sparse format is not needed because input to embedding layers is not typically that large. So, we are fine with dense format.

However, when working with text models, I found out following issues.

  • String as input/output is not supported at all in TensorFlowTransform. TensorFlowSharp also has limited functionality in this regard.
  • The modes that are not based in string inputs are composed of two set of resources.
    1. Model file
    2. Text resources such as dictionary to convert text items (words, characters) into vector of integers.
  • The conversion is a pre-processing step so we need to find out a way to convert text items into vector of integers. I tried using TermLookupTransform and TermTransform both did not work.
  • For the models that accept fixed length text input, we need to find out a way to trim and pad vectors so that appropriate sized vector can passed to TensorFlow. Variable sized inputs should not have an issue.

I currently don't see any issue with retrieving outputs from TensorFlow. I will write more if I encounter other issues.

@zeahmed

This comment has been minimized.

Copy link
Member

commented Sep 12, 2018

The TermlookupTransform does not seem to operate on vectors while TermTransform outputs Key type which cannot be used in TensorflowTransform currently.

@shauheen shauheen removed this from To Do in v0.6 Sep 25, 2018

@shauheen shauheen added this to To Do in Backlog via automation Nov 27, 2018

@asthana86

This comment has been minimized.

Copy link
Collaborator

commented Jan 15, 2019

This issue currently blocks the UI tooling for ML.NET. Can this issue be addressed sooner to unblock tooling work.

@zeahmed

This comment has been minimized.

Copy link
Member

commented Jan 15, 2019

@asthana86, This issue cannot be closed currently because it requires a few more features to be developed in ML.Net like padding and trimming transform. Can you please let me know how this is blocking tooling work? I may be helpful in unblocking you then.

@zeahmed zeahmed removed this from In Progress in DNN and ONNX scoring in ML.NET Jan 16, 2019

@zeahmed zeahmed added the enhancement label Jan 16, 2019

@zeahmed zeahmed added this to In Progress in v0.10 Jan 16, 2019

@zeahmed zeahmed moved this from In Progress to To Do in v0.10 Jan 16, 2019

@shauheen shauheen removed this from To Do in v0.10 Jan 28, 2019

@shauheen shauheen added this to To do in v0.11 Jan 28, 2019

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Text usage would be handled here: #2545

@shauheen shauheen removed this from To do in v0.11 Mar 6, 2019

@zeahmed

This comment has been minimized.

Copy link
Member

commented May 22, 2019

The following example show use-cases for text classification and string input/output. Most of the point raised in this issues are covered now. I am closing it.

@zeahmed zeahmed closed this May 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.