Skip to content

Conversation

@thunterdb
Copy link
Contributor

@thunterdb thunterdb commented Apr 18, 2017

Adds support for ingesting byte arrays (the binary type in spark, which corresponds to the string type in tensorflow).

This also refactors some of the internals to ensure that the dependencies to tensorflow and spark are contained to the encoding and the decoding. This helps significantly when checking for the correctness of the code.

Includes an integration test that reads a binary string in spark.

#94

@thunterdb thunterdb added this to the Release 0.2.7 milestone Apr 18, 2017
@sueann
Copy link

sueann commented Apr 18, 2017

What does this PR enable? Loading images into TensorFrames? ? Can you make the description a little more descriptive :-)

@sueann sueann self-requested a review April 18, 2017 23:32
}

// ********** STRING *********
// This is actually byte arrays, which corresponds to the 'binary' type in Spark.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you pass in a real string column?? wouldn't be cleaner to explicitly pass in bytearrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The String type in Spark is different: it corresponds to the UTF-8 encoded representation of textual data, and it would not be accepted. The reason for that is that you would need to specify the encoding (UTF-8, UTF-16, etc.) in order to convert it to a byte array. In order to bypass this problem, tensorflow only accepts byte arrays (which they call 'strings').

Under the hood, though, Spark passes byte arrays to this function (see appendRaw below)


override def convertTensor(t: tf.Tensor): MWrappedArray[Array[Byte]] = {
// TODO(tjh) implement later
???
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening here?
1/ does it compile?
2/ are these not needed??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, this is a special scala syntax for holes in the program. It should be replaced by proper exceptions.

@thunterdb
Copy link
Contributor Author

Merging this PR, thanks @sueann for taking a look!

@thunterdb thunterdb merged commit b393bf3 into databricks:master Apr 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants