Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make python DeepImageFeaturizer use Scala version. #88

Merged
merged 13 commits into from Jan 23, 2018

Conversation

tomasatdatabricks
Copy link
Collaborator

  • Based of Image schema PR, do not merge until Image schema is merged.
  • Otherwise mostly straightforward except results will not match keras in general due to different image libraries

Copy link
Collaborator

@sueann sueann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an initial pass. In the future, could you fix up the obvious style issues before requesting review so we don't spend as much time on them (if it'll help, we can invest in putting in a linter for this repo), and pay a bit more attention to the readability? Thanks!

Also let's add @MrBago as a reviewer since he is very most familiar with these parts.

@@ -117,12 +117,10 @@ def getInputTensor(self):
def getOutputTensor(self):
tensor_name = self.getOrDefault(self.outputTensor)
return self.getGraph().get_tensor_by_name(tensor_name)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep these newlines between functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, can you remove the changes in this file since they are not needed and irrelevant to the PR (sorry if I'm seeing weird changes because I'm using github wrong)

@@ -235,3 +216,45 @@ def _buildTFGraphForName(name, featurize):
modelData["graph"] = graph

return modelData

class PyDeepImageFeaturizer(Transformer, HasInputCol, HasOutputCol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for testing? If so, can you comment and make it really clear it's not for normal usage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still WIP, I have not decided what to do with Py version of the featurizer. I can mark it as test only.

@@ -182,7 +182,7 @@ def test_featurization(self):
Tests that featurizer returns (almost) the same values as Keras.
"""
output_col = "prediction"
transformer = DeepImageFeaturizer(inputCol="image", outputCol=output_col,
transformer = PyDeepImageFeaturizer(inputCol="image", outputCol=output_col,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't test what we need to anymore. Let's discuss offline.

@@ -215,3 +215,24 @@ def test_featurizer_in_pipeline(self):
pred_df_collected = lrModel.transform(train_df).collect()
for row in pred_df_collected:
self.assertEqual(int(row.prediction), row.label)

def test_scala_vs_py(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this function more readable by breaking down long lines and grouping lines into sections. i will review it after 😂

@@ -0,0 +1,105 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add license info here (see other files)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need the top-level license for the file (in addition to the model-specific ones you have already put in below).

from hashlib import sha256
from base64 import b64encode

def gen_model(name,model, model_file, version=1,featurize=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after 1,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix the spacing everywhere in this file

return '/Users/tomas/dev/spark-deep-learning/python/tests/resources/images'


def test_scala_vs_py():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this also more readable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same test as what's in the python portion? how/when is this meant to be called?

"""


from hashlib import sha256
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put all imports at the top of file

model_file.write(scala_template %{"name":name,"height":model.inputShape()[0],"width":model.inputShape()[1],"version":version,"base64":base64_hash})
return g2

import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import up top

@@ -38,6 +38,7 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def

final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
final val scaleFast: Param[Boolean] = new Param[Boolean](this,"scaleFast","use fast resizing if set.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "fast resizing" mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a flag singnaling to the resizer to use the other resize method. Not sure how to name it it's not mapped to a single flag on the java side either. SCALE_FAST or SCALE_DEFAULT would work, probably a few others as well. the name is WIP

@tomasatdatabricks tomasatdatabricks force-pushed the tomas/ML_3150 branch 5 times, most recently from d2c8562 to 24f3290 Compare December 19, 2017 19:47
@codecov-io
Copy link

codecov-io commented Dec 19, 2017

Codecov Report

Merging #88 into master will decrease coverage by 0.04%.
The diff coverage is 86.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
- Coverage   82.49%   82.44%   -0.05%     
==========================================
  Files          33       34       +1     
  Lines        1879     1937      +58     
  Branches       35       36       +1     
==========================================
+ Hits         1550     1597      +47     
- Misses        329      340      +11
Impacted Files Coverage Δ
python/sparkdl/transformers/tf_image.py 94.11% <ø> (ø) ⬆️
...main/scala/com/databricks/sparkdl/ImageUtils.scala 90.9% <100%> (ø) ⬆️
src/main/scala/com/databricks/sparkdl/Models.scala 85% <85%> (ø)
...a/com/databricks/sparkdl/DeepImageFeaturizer.scala 93.65% <86.66%> (-1.74%) ⬇️
python/sparkdl/transformers/named_image.py 91.4% <88.46%> (-2.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeff9c9...daffce8. Read the comment docs.

Copy link
Contributor

@MrBago MrBago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two high level concerns about this PR.

First, the auto generation of Models.scala file seems fickle and that code doesn't have any unit tests. It seems very likely that we'll need to change it next time someone needs to run it. For example, if we need to update a single model the script will roll all the version numbers so we either will need to upload a bunch of unchanged models or manually mess with model numbers. My suggestion is that we isolate this code and put it somewhere outside the python packages (executable scripts shouldn't go in the package anyways) and formalize the process for adding new models in a follow task.

Second, the testing in DLP is a total beast and we're making it worse. This is a pretty small package and our CI tests take like 4 hours, is that right? I can't get the tests to pass on my local machine and I've been trying. This PR now requires all the tests to download all the models in order to run the tests. I think we should think about how we can test the correctness of the python code that runs the tensorflow featurizers without running every single tensorflow featurizer.

We should also test the graph files we export, but the hashes will ensure those don't change so we don't need to test that every time we run the test suite.

to the image column in DataFrame. The output is a MLlib Vector so that DeepImageFeaturizer
can be used in a MLlib Pipeline.
The input image column should be 3-channel SpImage.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing the doc string? I think all public classes should have docstrings.

scalaFeaturizer.setModelName(self.getOrDefault(self.modelName))
scalaFeaturizer.setInputCol(self.getOrDefault(self.inputCol))
scalaFeaturizer.setOutputCol(self.getOrDefault(self.outputCol))
if(self.isDefined(self.scaleHint)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually put () around if condition in python, eg:

if condition:
    pass

scalaFeaturizer.setInputCol(self.getOrDefault(self.inputCol))
scalaFeaturizer.setOutputCol(self.getOrDefault(self.outputCol))
if(self.isDefined(self.scaleHint)):
scalaFeaturizer.setResizeFlag(self.getOrDefault(self.scaleHint))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is scaleHint treated differently than the other params?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaleHint is optional, (input/output)col are not.
The isDefined call is so that we don't override scala's default.

return self._set(modelName=value)

def getModelName(self):
return self.getOrDefault(self.modelName)

def _transform(self, dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we inherit fromJavaTransformer and use the default _transform implementation here. Was that considered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of JavaTransformer. I'll look into it.

outTensor = tf.to_double(tf.reshape(m.output, [-1]), name="%s_sparkdl_output__" % name)
gdef = tfx.strip_and_freeze_until([outTensor], session.graph, session, False)
g2 = tf.Graph()
with g2.as_default():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need nested with graph.as_default():` blocks? The semantics of this isn't clear to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do need a new graph into which we read the definition. Otherwise it imports it into the existing graph.

@@ -42,7 +42,7 @@ class PythonUnitTestCase(unittest.TestCase):
class TestSparkContext(object):
@classmethod
def setup_env(cls):
cls.sc = SparkContext('local[*]', cls.__name__)
cls.sc = SparkContext.getOrCreate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I think TestSparkContext is expected to create a new context for each test suite.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope I checked with Philip who is teh author of that line and i is fine this way.
The reason is that I added test which runs prior to this one and which creates a spark context, in which case this test fails with too many spark contexts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now we run each individual test file separately https://github.com/databricks/spark-deep-learning/blob/master/python/run-tests.sh#L100-L109 in our Travis CI.
(Correct me if I am wrong) I think creating new ones or not shouldn't matter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters when you run the tests locally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ I agree. When testing locally and running multiple tests simultaneously, there will be conflict. What I was trying to say is that I support the way you are changing the spark context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ... I don't think you can define the params in the constructor (I'd like to know if you can) because I think the HasParams class uses inspection on the class to discover all the params. We might be able to validate the param values using a function that caches the java content the first time it's run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait ... does the validator on scala Param not get run until stransform is called? I might look into fixing that after this release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The params are not transferred to scala until you call _transform. At least that's how it looks to me after a quick look. I can transfer them eagerly in setParams.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O both work. I can either

  1. leave validation to scala and eagerly transfer params
    or
  2. add converter with lazy initialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think within mllib lazy initialization is mostly used, I'm not sure why that's the case.

imageDf = imageDf.coalesce(self.numPartitionsOverride)

transformer = DeepImageFeaturizer(inputCol='image', modelName=self.name,
outputCol="features")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

@@ -147,7 +162,7 @@ object DeepImageFeaturizer extends DefaultParamsReadable[DeepImageFeaturizer] {

// TODO: support batched graphs with mapBlocks

private[sparkdl] trait NamedImageModel {
protected[sparkdl] trait NamedImageModel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Do we expect this to be changed in subclasses?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to access it from the Models.scala file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think private[sparkdl] means you can access it from within the sparkdl package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, good catch

@@ -95,14 +104,20 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def
this
}

def setResizeFlag(value: String): this.type = {
set(scaleHint, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always use the naming convention setX for param X. Could this be renamed to setScaleHint or could we rename the param as resizeFlag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've renamed resizeFlag to scaleHint and forgot to change the method. Good catch

@@ -113,13 +115,15 @@ private[sparkdl] object ImageUtils {
* @param tgtChannels number of channels of output image (must be 3), may be used later to
* support more channels.
* @param spImage image to resize.
* @param scaleHint hint which algorhitm to use, see java.awt.Image#SCALE_DEFAULT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests that our default is SCALE_DEFAULT, but I don't think we do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I think SCALE_DEFAULT might be implementation specific for java.awt.Image subclasses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I randomly picked one, it is not supposed to suggest what the default value is. I can replace it with default.

@tomasatdatabricks
Copy link
Collaborator Author

As for the high level comments:

  1. Sure, you might need to change the script. The aim here is not to make script which will work without change forever. It's a script which was used for generating the models/graphs for this version, that's it. If you make changes you might need to change the script. Leaving it in the same repo is mostly for convenience. I really don't see a problem here. I can add comments clarifying this.

  2. Regarding the tests, I agree the test run for too long. But are you saying this PR adds significantly to the test runtime? The test seem to be running for pretty much the same time. It think this has been an issue before an so an test . overhaul belongs to a separate PR.



def gen_model(name, model, model_file, version=1, featurize=True):
g = tf.Graph()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to read & understand that there is no scope overlap:

g = tf.Graph()
with tf.Session(graph=g) as sess:
  ...
  gdef =  tfx.strip_and_freeze_until([outTensor], sess.graph, sess, return_graph=False)
g2 = tf.Graph()
with tf.Session(graph=g2) as sess:    # i believe `with g2.as_default():` instead of initializing a session would also work here but I'd do it outside of the previous session for clarity.
  tf.import_graph_def(gdef, name='')
  filename = "sparkdl-%s_v%d.pb" % (name, version)
  print 'writing out ', filename
  tf.train.write_graph(g2.as_graph_def(), logdir="./", name=filename, as_text=False)
# I'd return here and deal with the scala file writing elsewhere, but if you want to put anything more, you can do it outside the second session.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's way cleaner, thanks.

@MrBago
Copy link
Contributor

MrBago commented Dec 20, 2017

One more thing, I know that we've already talked about style but I don't know if the linter will catch this. We should be consistent about using camelCase as much as possible. And also lets try and be consistent with indentation:

scala(
  lines,
  wrap,
  likeThis)

python(wraps, like,
       this)

weShouldNot(use,
            this,
            hybrid,
            style)

(We had a conversation about adopting a single indentation style for the whole project back in Jun, and while I think there were some good arguments for that we decided it would best to stick to the predominant style in each of the two languages.)

@tomasatdatabricks
Copy link
Collaborator Author

Ok I've pushed updated version.

Biggest change - I've changed the gen_app_models.py to not modify the Models.scala directly.
It now generates the Models.scala.generated file in the working directory and lets user handle the rest.

@smurching smurching self-requested a review December 21, 2017 01:04
Copy link
Collaborator

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, left a few comments

kerasPredict = self.kerasPredict

def rowWithImage(img):
# return [imageIO.imageArrayToStruct(img.astype('uint8'), imageType.sparkMode)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line?

def rowWithImage(img):
# return [imageIO.imageArrayToStruct(img.astype('uint8'), imageType.sparkMode)]
row = imageIO.imageArrayToStruct(img.astype('uint8'))
# re-order row to avoid pyspark bug
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: If there's a JIRA ticket for the bug note the ticket name (e.g. "SPARK-xxxx") in this comment?

@@ -38,6 +41,9 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def

final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
final val scaleHint: Param[String] = new Param(this,"scaleHint","hint which method to use for resizing.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Spaces after commas 🙃

@@ -95,14 +104,20 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def
this
}

def getScaleHint(value: String): this.type = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should this be setScaleHint (not getScaleHint)?

self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None, modelName=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to implement setParams along with Python getters/setters for scaleHint and modelName. You can look at the PySpark's Bucketizer implementation as an example: https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L318

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little strange that our unit tests didn't catch this, in PySpark we run doctests that attempt to get/set params but I guess we don't have equivalent tests in DLP. Would be good to add such tests at some point, probably in a future PR...cc @MrBago

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I did not know this was required. I saw it somewhere in the code but I did not see a use for it. I'll add it, thanks.

Copy link
Contributor

@MrBago MrBago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does test_featurization_no_reshape require us to download all the model graphs in order run our unit tests?

That's probably ok for now, but I think we should follow up soon with a task to restructure the testing a bit. It would be really nice to get the testing time down for our primary test suite.

@@ -0,0 +1,142 @@
#!/bin/python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard python shebang is #!/usr/bin/env python

@@ -0,0 +1,142 @@
#!/bin/python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this file to python/scripts/ or python/model_gen?

"""

modelName = Param(Params._dummy(), "modelName", "A deep learning model name",
typeConverter=SparkDLTypeConverters.buildSupportedItemConverter(SUPPORTED_MODELS))

scaleHint = Param(Params._dummy(), "scaleHint", "Hint which algorhitm to use for image resizing",
typeConverter=_scaleHintConverter)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to add getter and setter for scaleHint & modelName.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, already added.

modelName=self.getModelName(), featurize=True)
return transformer.transform(dataset)
# TODO: give an option to take off multiple layers so it can be used in tuning
# (could be the name of the layer or int for how many to take off).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kind of a big task for a source comment, should we maybe track this in an issue or jira instead?

@@ -86,7 +86,8 @@ def test_saved_graph_novar(self):

def gin_fun(session):
_build_saved_model(session, saved_model_dir)
return TFInputGraph.fromGraph(session.graph, session, [_tensor_input_name], [_tensor_output_name])
return TFInputGraph.fromGraph(session.graph, session, [
_tensor_input_name], [_tensor_output_name])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: can we move [ to next line?

import org.tensorframes.impl.DebugRowOps
import org.tensorframes.{Shape, ShapeDescription}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the re-ordering of all the imports?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result Idea's optimize imports. I guess it puts them in alphabetical order.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original is the style we're using in the repo (following Databricks standards). You can setup your IntelliJ to use our convention: https://github.com/databricks/scala-style-guide#imports

@@ -0,0 +1,105 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

import java.awt.image.BufferedImage
import java.awt.{Color, Image}

import com.sun.javafx.iio.ImageStorage.ImageType
import org.apache.spark.ml.image.ImageSchema
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.UserDefinedFunction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets drop the unused imports here:

awt, sql.function, sql. expressions and com.sum.

def transform(dataFrame: Dataset[_]): DataFrame = {
validateSchema(dataFrame.schema)
val model = DeepImageFeaturizer.supportedModelMap(getModelName)

val imSchema = ImageSchema.columnSchema
val height = model.height
val width = model.width
val resizeUdf = udf((image: Row) => ImageUtils.resizeImage(height, width, 3, image), imSchema)

val resizeUdf = udf((image: Row) => ImageUtils.resizeImage(height, width, 3, image, DeepImageFeaturizer.scaleHints(getScaleHint)), imSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length

kerasReshaped[i],
features_sc[i]) for i in range(
len(features_sc))]
np.testing.assert_array_almost_equal([0 for i in range(len(features_sc))], diffs, decimal=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy lets you compare scalers to arrays:

np.testing.assert_array_almost_equal(diffs, 0., decimal=2)

@tomasatdatabricks tomasatdatabricks changed the title [WIP] Make python DeepImageFeaturizer use Scala version. Make python DeepImageFeaturizer use Scala version. Jan 4, 2018
Copy link
Collaborator

@sueann sueann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good to me. Some changes requested (they shouldn't require much logic change).

Summarizing the tests for my sake, we have (please correct me if I'm wrong):

  • In Python, for all application models:
    • without resizing, Keras <-> DeepImageFeaturizer equivalence
    • with resizing, Keras <-> DeepImageFeaturizer proximity via cosine distance (+ one-time test via transfer learning with DeepImageFeaturizer features)
  • In Scala, unit tests for DeepImageFeaeturizer on a simple tf graph.
    The Python tests are essentially integration tests for all the models in DeepImageFeaturizer.

#
# Takes keras models in sparkdl.transformers.keras_applications and prepends reshaping from ImageSchema
# and model specific preprocessing.
# Produces tensor flow model files and a scala file containing scala wrappers for all the models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tensor flow -> TensorFlow

# 1. model *.pb files (need to be uploaded to S3) .
# 2. generated scala model wrappers Models.scala.generated (needs to be moved over to appropriate scala folder)
#
from base64 import b64encode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert space before

# and model specific preprocessing.
# Produces tensor flow model files and a scala file containing scala wrappers for all the models.
#
# Input: sparkdl.transformers.keras_aplications.KERAS_APPLICATION_MODELS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keras_aplications -> keras_applications

this is not actually input by user, right? it's read automatically by the script. let's make that clear.

@@ -16,8 +16,13 @@
from keras.applications.imagenet_utils import decode_predictions
import numpy as np


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

return dict(featurizer.scaleHintsJava()).keys()


class _LazyCaleHintConverter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_LazyScaleHintConverter (typo) ?

import org.tensorframes.impl.DebugRowOps
import org.tensorframes.{Shape, ShapeDescription}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original is the style we're using in the repo (following Databricks standards). You can setup your IntelliJ to use our convention: https://github.com/databricks/scala-style-guide#imports

@@ -0,0 +1,105 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need the top-level license for the file (in addition to the model-specific ones you have already put in below).

'',
' private[sparkdl] object TestNet extends NamedImageModel {',
' /**',
' * A simple test graph used for testing DeepImageFeaturizer',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

}

/**
* Model provided by Keras. All cotributions by Keras are provided subject to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the import reordering (git does not let me responded above :/ )

Thanks, the Idea setup helps! I think Idea actually did the right thing in alphabetizing imports from third party libraries. The only violations of the rules is the scala.* reference at the bottom.

}

/**
* Model provided by Keras. All cotributions by Keras are provided subject to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Due to the difference in images resized by different libraries, DeepImageFeaturizer is no longer required to match the results from keras on raw (non-resized) images.
Instead of computing l-inf or l2 norm of the two feature vectors, we compare their cosine distance and require it to be sufficiently "low" (< 1e-2).
We also ran several transfer learning examples and ensured that the results were comparable.
These experiments were successfull and new sparkdl's features proved to be at least as good as native keras ones, however, they have not been added as automated tests.

Overall I think the combination of (1) you get exact match with no resize and (2) not too different with resize is good enough.

Cosine distance justification:
Cosine distance ensures that the resulting feature vector has similar direction. Intuitively, this is important property for the generated features and I think ensuring cosine distance is low enough gives better guarantees than computing l2 or l-inf norm and comparing with huge allowed diff.

I did comparisons to different images, various amounts of added noise and some obvious bugs I could think of such as skipping the preprocess or having the color channels flipped. Most distances came orders of magnitude higher, noise with sd = 0.01 got comparable distance. Here's the breakdown on the test images: (the distance metric is by definition from [0,1] interval):

cosine distance per image to the same image with added (normal, mean = 0) noise:
sd = 1.00: [0.69, 0.78, 0.77 0.75, 0.76]
sd = 0.10: [0.1, 0.2, 0.31, 0.12, 0.23]
sd = 0.01: [0.0078, 0.0094, 0.060, 0.0040, 0.0085]

cosine distance with no preprocessing: all ~ 0.9
cosine distance with faulty preprocess (mean of one channel is incorrect): all ~ 0.1
cosine distance with flipped channels: all ~ 0.3
cosine distance matrix for the test images:
[ 0.00 0.75 0.71 0.70 0.67]
[ 0.75 0.00 0.79 0.85 0.74]
[ 0.71 0.80 0.00 0.75 0.69]
[ 0.70 0.85 0.75 0.00 0.70]
[ 0.67 0.74 0.69 0.70 0.00]
…p_models.py script no longer modifies the Models.scala source direclty. Instead it generates a file in the current working directory and lets user copy it.
  Few minor fixes, scaleHint converter in DeepImageFeaturizer is now lazy (and params are eagerly trasnfered to jvm)
  Added licences for generated named model wrappers.
Copy link
Collaborator

@sueann sueann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just small comments. lgtm otherwise. thanks!

"name": name,
"height": model.inputShape()[0],
"width": model.inputShape()[1],
"version": version,
"base64": base64_hash})
"base64": base64_hash},2))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might need to be 1 instead of 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2 is correct, it's the number of spaces for the indent, not a number of indent levels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm not sure what's going on but the generated file's indent doesn't look right:

/**
  * What's in Models.scala
  */
/**
 * Expected
 */

g = gen_model(name=name, model=modelConstructor(), model_file=f)
if not name in licenses:
raise KeyError("Missing license for model '%s'" % name )
g = gen_model(license = licenses[name],name=name, model=modelConstructor(), model_file=f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

"""
setParams(self, inputCol=None, outputCol=None, modelName=None, decodePredictions=False,
topK=5)
scaleHint="SCALE_AREA_AVERAGING",topK=5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

"""


def gen_model(name, model, model_file, version=1, featurize=True):
def indent(s, lvl):
return '\n'.join([' '*lvl + x for x in s.split('\n')])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space around *

@sueann
Copy link
Collaborator

sueann commented Jan 18, 2018

oh and need to rerun travis

"""
setParams(self, inputCol=None, outputCol=None, modelName=None)
setParams(self, inputCol=None, outputCol=None, modelName=None, decodePredictions=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove decodePredictions & topK (not sure if we talked about this - the comment here goes to docs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, that's a good catch, thanks

@@ -38,6 +39,9 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def

final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
final val scaleHint: Param[String] = new Param(this,"scaleHint", "hint which method to use for resizing.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after this,

"name": name,
"height": model.inputShape()[0],
"width": model.inputShape()[1],
"version": version,
"base64": base64_hash})
"base64": base64_hash},2))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm not sure what's going on but the generated file's indent doesn't look right:

/**
  * What's in Models.scala
  */
/**
 * Expected
 */

@@ -16,6 +16,8 @@
import numpy as np
import os

from scipy import spatial
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on with the import ordering & grouping here 😂 (lines 16-25)

Copy link
Collaborator

@sueann sueann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment fix.

kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None, modelName=None):
def setParams(self, inputCol=None, outputCol=None, modelName=None, scaleHint="SCALE_AREA_AVERAGING"):
"""
setParams(self, inputCol=None, outputCol=None, modelName=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add scaleHint="SCALE_AREA_AVERAGING"

@sueann sueann merged commit 12b2697 into databricks:master Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants