-
Notifications
You must be signed in to change notification settings - Fork 161
[94] Support for strings #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What does this PR enable? Loading images into TensorFrames? ? Can you make the description a little more descriptive :-) |
| } | ||
|
|
||
| // ********** STRING ********* | ||
| // This is actually byte arrays, which corresponds to the 'binary' type in Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you pass in a real string column?? wouldn't be cleaner to explicitly pass in bytearrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The String type in Spark is different: it corresponds to the UTF-8 encoded representation of textual data, and it would not be accepted. The reason for that is that you would need to specify the encoding (UTF-8, UTF-16, etc.) in order to convert it to a byte array. In order to bypass this problem, tensorflow only accepts byte arrays (which they call 'strings').
Under the hood, though, Spark passes byte arrays to this function (see appendRaw below)
|
|
||
| override def convertTensor(t: tf.Tensor): MWrappedArray[Array[Byte]] = { | ||
| // TODO(tjh) implement later | ||
| ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's happening here?
1/ does it compile?
2/ are these not needed??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, this is a special scala syntax for holes in the program. It should be replaced by proper exceptions.
|
Merging this PR, thanks @sueann for taking a look! |
Adds support for ingesting byte arrays (the
binarytype in spark, which corresponds to thestringtype in tensorflow).This also refactors some of the internals to ensure that the dependencies to tensorflow and spark are contained to the encoding and the decoding. This helps significantly when checking for the correctness of the code.
Includes an integration test that reads a binary string in spark.
#94