Make python DeepImageFeaturizer use Scala version. #88

tomasatdatabricks · 2017-12-16T00:19:39Z

Based of Image schema PR, do not merge until Image schema is merged.
Otherwise mostly straightforward except results will not match keras in general due to different image libraries

sueann

Here is an initial pass. In the future, could you fix up the obvious style issues before requesting review so we don't spend as much time on them (if it'll help, we can invest in putting in a linter for this repo), and pay a bit more attention to the readability? Thanks!

Also let's add @MrBago as a reviewer since he is very most familiar with these parts.

sueann · 2017-12-16T00:24:48Z

python/sparkdl/transformers/tf_image.py

@@ -117,12 +117,10 @@ def getInputTensor(self):
    def getOutputTensor(self):
        tensor_name = self.getOrDefault(self.outputTensor)
        return self.getGraph().get_tensor_by_name(tensor_name)
-


can we keep these newlines between functions

In fact, can you remove the changes in this file since they are not needed and irrelevant to the PR (sorry if I'm seeing weird changes because I'm using github wrong)

sueann · 2017-12-16T00:25:17Z

python/sparkdl/transformers/named_image.py

@@ -235,3 +216,45 @@ def _buildTFGraphForName(name, featurize):
    modelData["graph"] = graph

    return modelData
+
+class PyDeepImageFeaturizer(Transformer, HasInputCol, HasOutputCol):


Is this for testing? If so, can you comment and make it really clear it's not for normal usage?

this is still WIP, I have not decided what to do with Py version of the featurizer. I can mark it as test only.

sueann · 2017-12-16T00:28:01Z

python/tests/transformers/named_image_test.py

@@ -182,7 +182,7 @@ def test_featurization(self):
        Tests that featurizer returns (almost) the same values as Keras.
        """
        output_col = "prediction"
-        transformer = DeepImageFeaturizer(inputCol="image", outputCol=output_col,
+        transformer = PyDeepImageFeaturizer(inputCol="image", outputCol=output_col,


This doesn't test what we need to anymore. Let's discuss offline.

sueann · 2017-12-16T00:29:08Z

python/tests/transformers/named_image_test.py

@@ -215,3 +215,24 @@ def test_featurizer_in_pipeline(self):
        pred_df_collected = lrModel.transform(train_df).collect()
        for row in pred_df_collected:
            self.assertEqual(int(row.prediction), row.label)
+
+    def test_scala_vs_py(self):


can you make this function more readable by breaking down long lines and grouping lines into sections. i will review it after 😂

sueann · 2017-12-16T01:39:14Z

src/main/scala/com/databricks/sparkdl/Models.scala

@@ -0,0 +1,105 @@
+


add license info here (see other files)

we still need the top-level license for the file (in addition to the model-specific ones you have already put in below).

sueann · 2017-12-18T18:54:11Z

python/sparkdl/utils/generate_app_models.py

+from hashlib import sha256
+from base64 import b64encode
+
+def gen_model(name,model, model_file, version=1,featurize=True):


space after 1,

can you fix the spacing everywhere in this file

sueann · 2017-12-18T18:55:12Z

python/sparkdl/utils/generate_app_models.py

+    return '/Users/tomas/dev/spark-deep-learning/python/tests/resources/images'
+
+
+def test_scala_vs_py():


please make this also more readable

Is this the same test as what's in the python portion? how/when is this meant to be called?

sueann · 2017-12-18T18:56:42Z

python/sparkdl/utils/generate_app_models.py

+"""
+
+
+from hashlib import sha256


put all imports at the top of file

sueann · 2017-12-18T18:58:19Z

python/sparkdl/utils/generate_app_models.py

+                model_file.write(scala_template %{"name":name,"height":model.inputShape()[0],"width":model.inputShape()[1],"version":version,"base64":base64_hash})
+                return g2
+
+import os


import up top

sueann · 2017-12-18T19:00:47Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

@@ -38,6 +38,7 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def

  final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
  final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
+  final val scaleFast: Param[Boolean] = new Param[Boolean](this,"scaleFast","use fast resizing if set.")


what does "fast resizing" mean?

it's just a flag singnaling to the resizer to use the other resize method. Not sure how to name it it's not mapped to a single flag on the java side either. SCALE_FAST or SCALE_DEFAULT would work, probably a few others as well. the name is WIP

codecov-io · 2017-12-19T19:47:51Z

Codecov Report

Merging #88 into master will decrease coverage by 0.04%.
The diff coverage is 86.58%.

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
- Coverage   82.49%   82.44%   -0.05%     
==========================================
  Files          33       34       +1     
  Lines        1879     1937      +58     
  Branches       35       36       +1     
==========================================
+ Hits         1550     1597      +47     
- Misses        329      340      +11

Impacted Files	Coverage Δ
python/sparkdl/transformers/tf_image.py	`94.11% <ø> (ø)`	⬆️
...main/scala/com/databricks/sparkdl/ImageUtils.scala	`90.9% <100%> (ø)`	⬆️
src/main/scala/com/databricks/sparkdl/Models.scala	`85% <85%> (ø)`
...a/com/databricks/sparkdl/DeepImageFeaturizer.scala	`93.65% <86.66%> (-1.74%)`	⬇️
python/sparkdl/transformers/named_image.py	`91.4% <88.46%> (-2.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeff9c9...daffce8. Read the comment docs.

…ion.

…ng flag

MrBago

I have two high level concerns about this PR.

First, the auto generation of Models.scala file seems fickle and that code doesn't have any unit tests. It seems very likely that we'll need to change it next time someone needs to run it. For example, if we need to update a single model the script will roll all the version numbers so we either will need to upload a bunch of unchanged models or manually mess with model numbers. My suggestion is that we isolate this code and put it somewhere outside the python packages (executable scripts shouldn't go in the package anyways) and formalize the process for adding new models in a follow task.

Second, the testing in DLP is a total beast and we're making it worse. This is a pretty small package and our CI tests take like 4 hours, is that right? I can't get the tests to pass on my local machine and I've been trying. This PR now requires all the tests to download all the models in order to run the tests. I think we should think about how we can test the correctness of the python code that runs the tensorflow featurizers without running every single tensorflow featurizer.

We should also test the graph files we export, but the hashes will ensure those don't change so we don't need to test that every time we run the test suite.

MrBago · 2017-12-20T01:15:35Z

python/sparkdl/transformers/named_image.py

-    to the image column in DataFrame. The output is a MLlib Vector so that DeepImageFeaturizer
-    can be used in a MLlib Pipeline.
-    The input image column should be 3-channel SpImage.
-    """


Why are we removing the doc string? I think all public classes should have docstrings.

MrBago · 2017-12-20T01:23:46Z

python/sparkdl/transformers/named_image.py

+        scalaFeaturizer.setModelName(self.getOrDefault(self.modelName))
+        scalaFeaturizer.setInputCol(self.getOrDefault(self.inputCol))
+        scalaFeaturizer.setOutputCol(self.getOrDefault(self.outputCol))
+        if(self.isDefined(self.scaleHint)):


We don't usually put () around if condition in python, eg:

if condition: pass

MrBago · 2017-12-20T01:27:20Z

python/sparkdl/transformers/named_image.py

+        scalaFeaturizer.setInputCol(self.getOrDefault(self.inputCol))
+        scalaFeaturizer.setOutputCol(self.getOrDefault(self.outputCol))
+        if(self.isDefined(self.scaleHint)):
+            scalaFeaturizer.setResizeFlag(self.getOrDefault(self.scaleHint))


Why is scaleHint treated differently than the other params?

scaleHint is optional, (input/output)col are not.
The isDefined call is so that we don't override scala's default.

MrBago · 2017-12-20T01:28:11Z

python/sparkdl/transformers/named_image.py

-        return self._set(modelName=value)
-
-    def getModelName(self):
-        return self.getOrDefault(self.modelName)

    def _transform(self, dataset):


It would be nice if we inherit fromJavaTransformer and use the default _transform implementation here. Was that considered?

I was not aware of JavaTransformer. I'll look into it.

MrBago · 2017-12-20T19:15:23Z

python/sparkdl/utils/generate_app_models.py

+            outTensor = tf.to_double(tf.reshape(m.output, [-1]), name="%s_sparkdl_output__" % name)
+            gdef = tfx.strip_and_freeze_until([outTensor], session.graph, session, False)
+            g2 = tf.Graph()
+            with g2.as_default():


Do we really need nested with graph.as_default():` blocks? The semantics of this isn't clear to me.

Yes we do need a new graph into which we read the definition. Otherwise it imports it into the existing graph.

MrBago · 2017-12-20T19:23:27Z

python/tests/tests.py

@@ -42,7 +42,7 @@ class PythonUnitTestCase(unittest.TestCase):
 class TestSparkContext(object):
    @classmethod
    def setup_env(cls):
-        cls.sc = SparkContext('local[*]', cls.__name__)
+        cls.sc = SparkContext.getOrCreate()


Why this change? I think TestSparkContext is expected to create a new context for each test suite.

Nope I checked with Philip who is teh author of that line and i is fine this way.
The reason is that I added test which runs prior to this one and which creates a spark context, in which case this test fails with too many spark contexts

I think right now we run each individual test file separately https://github.com/databricks/spark-deep-learning/blob/master/python/run-tests.sh#L100-L109 in our Travis CI.
(Correct me if I am wrong) I think creating new ones or not shouldn't matter.

It matters when you run the tests locally

^ I agree. When testing locally and running multiple tests simultaneously, there will be conflict. What I was trying to say is that I support the way you are changing the spark context.

hmm ... I don't think you can define the params in the constructor (I'd like to know if you can) because I think the HasParams class uses inspection on the class to discover all the params. We might be able to validate the param values using a function that caches the java content the first time it's run.

Wait ... does the validator on scala Param not get run until stransform is called? I might look into fixing that after this release.

The params are not transferred to scala until you call _transform. At least that's how it looks to me after a quick look. I can transfer them eagerly in setParams.

O both work. I can either

leave validation to scala and eagerly transfer params
or

add converter with lazy initialization.

I think within mllib lazy initialization is mostly used, I'm not sure why that's the case.

MrBago · 2017-12-20T19:25:01Z

python/tests/transformers/named_image_test.py

+            imageDf = imageDf.coalesce(self.numPartitionsOverride)
+
+        transformer = DeepImageFeaturizer(inputCol='image', modelName=self.name,
+                                         outputCol="features")


MrBago · 2017-12-20T21:11:59Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

@@ -147,7 +162,7 @@ object DeepImageFeaturizer extends DefaultParamsReadable[DeepImageFeaturizer] {

  // TODO: support batched graphs with mapBlocks

-  private[sparkdl] trait NamedImageModel {
+  protected[sparkdl] trait NamedImageModel {


Why this change? Do we expect this to be changed in subclasses?

I needed to access it from the Models.scala file

I think private[sparkdl] means you can access it from within the sparkdl package.

True, good catch

MrBago · 2017-12-20T21:13:35Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

@@ -95,14 +104,20 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def
    this
  }

+  def setResizeFlag(value: String): this.type = {
+    set(scaleHint, value)


I think we always use the naming convention setX for param X. Could this be renamed to setScaleHint or could we rename the param as resizeFlag.

Yes, I've renamed resizeFlag to scaleHint and forgot to change the method. Good catch

MrBago · 2017-12-20T21:19:04Z

src/main/scala/com/databricks/sparkdl/ImageUtils.scala

@@ -113,13 +115,15 @@ private[sparkdl] object ImageUtils {
   * @param tgtChannels number of channels of output image (must be 3), may be used later to
   *                    support more channels.
   * @param spImage     image to resize.
+   * @param scaleHint   hint which algorhitm to use, see java.awt.Image#SCALE_DEFAULT


This suggests that our default is SCALE_DEFAULT, but I don't think we do.

btw, I think SCALE_DEFAULT might be implementation specific for java.awt.Image subclasses.

hmm I randomly picked one, it is not supposed to suggest what the default value is. I can replace it with default.

tomasatdatabricks · 2017-12-20T22:22:20Z

As for the high level comments:

Sure, you might need to change the script. The aim here is not to make script which will work without change forever. It's a script which was used for generating the models/graphs for this version, that's it. If you make changes you might need to change the script. Leaving it in the same repo is mostly for convenience. I really don't see a problem here. I can add comments clarifying this.
Regarding the tests, I agree the test run for too long. But are you saying this PR adds significantly to the test runtime? The test seem to be running for pretty much the same time. It think this has been an issue before an so an test . overhaul belongs to a separate PR.

sueann · 2017-12-20T22:28:10Z

python/sparkdl/utils/generate_app_models.py

+
+
+def gen_model(name, model, model_file, version=1, featurize=True):
+    g = tf.Graph()


Simpler to read & understand that there is no scope overlap:

g = tf.Graph() with tf.Session(graph=g) as sess: ... gdef = tfx.strip_and_freeze_until([outTensor], sess.graph, sess, return_graph=False) g2 = tf.Graph() with tf.Session(graph=g2) as sess: # i believe `with g2.as_default():` instead of initializing a session would also work here but I'd do it outside of the previous session for clarity. tf.import_graph_def(gdef, name='') filename = "sparkdl-%s_v%d.pb" % (name, version) print 'writing out ', filename tf.train.write_graph(g2.as_graph_def(), logdir="./", name=filename, as_text=False) # I'd return here and deal with the scala file writing elsewhere, but if you want to put anything more, you can do it outside the second session.

Yeah that's way cleaner, thanks.

MrBago · 2017-12-20T23:04:47Z

One more thing, I know that we've already talked about style but I don't know if the linter will catch this. We should be consistent about using camelCase as much as possible. And also lets try and be consistent with indentation:

scala(
  lines,
  wrap,
  likeThis)

python(wraps, like,
       this)

weShouldNot(use,
            this,
            hybrid,
            style)

(We had a conversation about adopting a single indentation style for the whole project back in Jun, and while I think there were some good arguments for that we decided it would best to stick to the predominant style in each of the two languages.)

tomasatdatabricks · 2017-12-21T01:01:37Z

Ok I've pushed updated version.

Biggest change - I've changed the gen_app_models.py to not modify the Models.scala directly.
It now generates the Models.scala.generated file in the working directory and lets user handle the rest.

smurching

Nice work, left a few comments

smurching · 2017-12-21T03:07:14Z

python/tests/transformers/named_image_test.py

+        kerasPredict = self.kerasPredict
+
+        def rowWithImage(img):
+            # return [imageIO.imageArrayToStruct(img.astype('uint8'), imageType.sparkMode)]


Remove this line?

smurching · 2017-12-21T03:10:12Z

python/tests/transformers/named_image_test.py

+        def rowWithImage(img):
+            # return [imageIO.imageArrayToStruct(img.astype('uint8'), imageType.sparkMode)]
+            row = imageIO.imageArrayToStruct(img.astype('uint8'))
+            # re-order row to avoid pyspark bug


Optional: If there's a JIRA ticket for the bug note the ticket name (e.g. "SPARK-xxxx") in this comment?

smurching · 2017-12-21T03:32:05Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

@@ -38,6 +41,9 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def

  final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
  final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
+  final val scaleHint: Param[String] = new Param(this,"scaleHint","hint which method to use for resizing.",


Nit: Spaces after commas 🙃

smurching · 2017-12-21T03:38:50Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

@@ -95,14 +104,20 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def
    this
  }

+  def getScaleHint(value: String): this.type = {


Typo: should this be setScaleHint (not getScaleHint)?

smurching · 2017-12-21T03:47:07Z

python/sparkdl/transformers/named_image.py

-        self.setParams(**kwargs)
-
-    @keyword_only
-    def setParams(self, inputCol=None, outputCol=None, modelName=None):


I think we still need to implement setParams along with Python getters/setters for scaleHint and modelName. You can look at the PySpark's Bucketizer implementation as an example: https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L318

It's a little strange that our unit tests didn't catch this, in PySpark we run doctests that attempt to get/set params but I guess we don't have equivalent tests in DLP. Would be good to add such tests at some point, probably in a future PR...cc @MrBago

Ah ok, I did not know this was required. I saw it somewhere in the code but I did not see a use for it. I'll add it, thanks.

MrBago

Does test_featurization_no_reshape require us to download all the model graphs in order run our unit tests?

That's probably ok for now, but I think we should follow up soon with a task to restructure the testing a bit. It would be really nice to get the testing time down for our primary test suite.

MrBago · 2018-01-04T00:31:00Z

python/generate_app_models.py

@@ -0,0 +1,142 @@
+#!/bin/python


The standard python shebang is #!/usr/bin/env python

MrBago · 2018-01-04T00:31:34Z

python/generate_app_models.py

@@ -0,0 +1,142 @@
+#!/bin/python
+


Can we move this file to python/scripts/ or python/model_gen?

MrBago · 2018-01-04T00:38:11Z

python/sparkdl/transformers/named_image.py

    """

    modelName = Param(Params._dummy(), "modelName", "A deep learning model name",
                      typeConverter=SparkDLTypeConverters.buildSupportedItemConverter(SUPPORTED_MODELS))

+    scaleHint = Param(Params._dummy(), "scaleHint", "Hint which algorhitm to use for image resizing",
+                      typeConverter=_scaleHintConverter)
+


Did you want to add getter and setter for scaleHint & modelName.

Yeah, already added.

MrBago · 2018-01-04T00:38:53Z

python/sparkdl/transformers/named_image.py

-                                             modelName=self.getModelName(), featurize=True)
-        return transformer.transform(dataset)
+# TODO: give an option to take off multiple layers so it can be used in tuning
+#       (could be the name of the layer or int for how many to take off).


This seems kind of a big task for a source comment, should we maybe track this in an issue or jira instead?

MrBago · 2018-01-04T00:39:47Z

python/tests/graph/test_import.py

@@ -86,7 +86,8 @@ def test_saved_graph_novar(self):

            def gin_fun(session):
                _build_saved_model(session, saved_model_dir)
-                return TFInputGraph.fromGraph(session.graph, session, [_tensor_input_name], [_tensor_output_name])
+                return TFInputGraph.fromGraph(session.graph, session, [
+                                              _tensor_input_name], [_tensor_output_name])


style: can we move [ to next line?

MrBago · 2018-01-04T00:49:06Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

 import org.tensorframes.impl.DebugRowOps
+import org.tensorframes.{Shape, ShapeDescription}


Why the re-ordering of all the imports?

result Idea's optimize imports. I guess it puts them in alphabetical order.

The original is the style we're using in the repo (following Databricks standards). You can setup your IntelliJ to use our convention: https://github.com/databricks/scala-style-guide#imports

MrBago · 2018-01-04T00:54:20Z

src/main/scala/com/databricks/sparkdl/Models.scala

@@ -0,0 +1,105 @@
+


MrBago · 2018-01-04T00:59:17Z

src/main/scala/com/databricks/sparkdl/ImageUtils.scala

 import java.awt.image.BufferedImage
 import java.awt.{Color, Image}

+import com.sun.javafx.iio.ImageStorage.ImageType
 import org.apache.spark.ml.image.ImageSchema
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.expressions.UserDefinedFunction


Lets drop the unused imports here:

awt, sql.function, sql. expressions and com.sum.

MrBago · 2018-01-04T00:59:54Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

  def transform(dataFrame: Dataset[_]): DataFrame = {
    validateSchema(dataFrame.schema)
    val model = DeepImageFeaturizer.supportedModelMap(getModelName)

    val imSchema = ImageSchema.columnSchema
    val height = model.height
    val width = model.width
-    val resizeUdf = udf((image: Row) => ImageUtils.resizeImage(height, width, 3, image), imSchema)
+
+    val resizeUdf = udf((image: Row) => ImageUtils.resizeImage(height, width, 3, image, DeepImageFeaturizer.scaleHints(getScaleHint)), imSchema)


MrBago · 2018-01-04T01:01:47Z

python/tests/transformers/named_image_test.py

+                kerasReshaped[i],
+                features_sc[i]) for i in range(
+                len(features_sc))]
+        np.testing.assert_array_almost_equal([0 for i in range(len(features_sc))], diffs, decimal=2)


Numpy lets you compare scalers to arrays:

np.testing.assert_array_almost_equal(diffs, 0., decimal=2)

sueann

Looks generally good to me. Some changes requested (they shouldn't require much logic change).

Summarizing the tests for my sake, we have (please correct me if I'm wrong):

In Python, for all application models:
- without resizing, Keras <-> DeepImageFeaturizer equivalence
- with resizing, Keras <-> DeepImageFeaturizer proximity via cosine distance (+ one-time test via transfer learning with DeepImageFeaturizer features)
In Scala, unit tests for DeepImageFeaeturizer on a simple tf graph.
The Python tests are essentially integration tests for all the models in DeepImageFeaturizer.

sueann · 2018-01-12T22:26:34Z

python/model_gen/generate_app_models.py

+#
+# Takes keras models in sparkdl.transformers.keras_applications and prepends reshaping from ImageSchema
+# and model specific preprocessing.
+# Produces tensor flow model files and a scala file containing scala wrappers for all the models.


nit: tensor flow -> TensorFlow

sueann · 2018-01-12T22:27:08Z

python/model_gen/generate_app_models.py

+#    1. model *.pb files (need to be uploaded to S3) .
+#    2. generated scala model wrappers Models.scala.generated (needs to be moved over to appropriate scala folder)
+#
+from base64 import b64encode


insert space before

sueann · 2018-01-12T22:28:17Z

python/model_gen/generate_app_models.py

+# and model specific preprocessing.
+# Produces tensor flow model files and a scala file containing scala wrappers for all the models.
+#
+# Input: sparkdl.transformers.keras_aplications.KERAS_APPLICATION_MODELS


keras_aplications -> keras_applications

this is not actually input by user, right? it's read automatically by the script. let's make that clear.

sueann · 2018-01-12T22:28:55Z

python/sparkdl/transformers/named_image.py

@@ -16,8 +16,13 @@
 from keras.applications.imagenet_utils import decode_predictions
 import numpy as np

+


remove newline

sueann · 2018-01-12T22:29:32Z

python/sparkdl/transformers/named_image.py

+    return dict(featurizer.scaleHintsJava()).keys()
+
+
+class _LazyCaleHintConverter:


_LazyScaleHintConverter (typo) ?

sueann · 2018-01-12T22:45:07Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

 import org.tensorframes.impl.DebugRowOps
+import org.tensorframes.{Shape, ShapeDescription}


The original is the style we're using in the repo (following Databricks standards). You can setup your IntelliJ to use our convention: https://github.com/databricks/scala-style-guide#imports

sueann · 2018-01-12T22:46:38Z

src/main/scala/com/databricks/sparkdl/Models.scala

@@ -0,0 +1,105 @@
+


we still need the top-level license for the file (in addition to the model-specific ones you have already put in below).

sueann · 2018-01-12T22:53:01Z

python/model_gen/generate_app_models.py

+            '',
+            '  private[sparkdl] object TestNet extends NamedImageModel {',
+            '  /**',
+            '  * A simple test graph used for testing DeepImageFeaturizer',


indentation

sueann · 2018-01-13T01:11:01Z

src/main/scala/com/databricks/sparkdl/Models.scala

+  }
+
+  /**
+    * Model provided by Keras. All cotributions by Keras are provided subject to the


About the import reordering (git does not let me responded above :/ )

Thanks, the Idea setup helps! I think Idea actually did the right thing in alphabetizing imports from third party libraries. The only violations of the rules is the scala.* reference at the bottom.

sueann · 2018-01-13T01:11:08Z

src/main/scala/com/databricks/sparkdl/Models.scala

+  }
+
+  /**
+    * Model provided by Keras. All cotributions by Keras are provided subject to the


Due to the difference in images resized by different libraries, DeepImageFeaturizer is no longer required to match the results from keras on raw (non-resized) images. Instead of computing l-inf or l2 norm of the two feature vectors, we compare their cosine distance and require it to be sufficiently "low" (< 1e-2). We also ran several transfer learning examples and ensured that the results were comparable. These experiments were successfull and new sparkdl's features proved to be at least as good as native keras ones, however, they have not been added as automated tests. Overall I think the combination of (1) you get exact match with no resize and (2) not too different with resize is good enough. Cosine distance justification: Cosine distance ensures that the resulting feature vector has similar direction. Intuitively, this is important property for the generated features and I think ensuring cosine distance is low enough gives better guarantees than computing l2 or l-inf norm and comparing with huge allowed diff. I did comparisons to different images, various amounts of added noise and some obvious bugs I could think of such as skipping the preprocess or having the color channels flipped. Most distances came orders of magnitude higher, noise with sd = 0.01 got comparable distance. Here's the breakdown on the test images: (the distance metric is by definition from [0,1] interval): cosine distance per image to the same image with added (normal, mean = 0) noise: sd = 1.00: [0.69, 0.78, 0.77 0.75, 0.76] sd = 0.10: [0.1, 0.2, 0.31, 0.12, 0.23] sd = 0.01: [0.0078, 0.0094, 0.060, 0.0040, 0.0085] cosine distance with no preprocessing: all ~ 0.9 cosine distance with faulty preprocess (mean of one channel is incorrect): all ~ 0.1 cosine distance with flipped channels: all ~ 0.3 cosine distance matrix for the test images: [ 0.00 0.75 0.71 0.70 0.67] [ 0.75 0.00 0.79 0.85 0.74] [ 0.71 0.80 0.00 0.75 0.69] [ 0.70 0.85 0.75 0.00 0.70] [ 0.67 0.74 0.69 0.70 0.00]

…p_models.py script no longer modifies the Models.scala source direclty. Instead it generates a file in the current working directory and lets user copy it.

Few minor fixes, scaleHint converter in DeepImageFeaturizer is now lazy (and params are eagerly trasnfered to jvm) Added licences for generated named model wrappers.

…_model.py to python/model_gen/

sueann

just small comments. lgtm otherwise. thanks!

sueann · 2018-01-18T00:33:57Z

python/model_gen/generate_app_models.py

            "name": name,
            "height": model.inputShape()[0],
            "width": model.inputShape()[1],
            "version": version,
-            "base64": base64_hash})
+            "base64": base64_hash},2))


space after ,

i think this might need to be 1 instead of 2

I think 2 is correct, it's the number of spaces for the indent, not a number of indent levels.

Mm not sure what's going on but the generated file's indent doesn't look right:

/** * What's in Models.scala */

/** * Expected */

sueann · 2018-01-18T00:34:11Z

python/model_gen/generate_app_models.py

-            g = gen_model(name=name, model=modelConstructor(), model_file=f)
+            if not name in licenses:
+                raise KeyError("Missing license for model '%s'" % name )
+            g = gen_model(license = licenses[name],name=name, model=modelConstructor(), model_file=f)


space after ,

sueann · 2018-01-18T00:34:21Z

python/sparkdl/transformers/named_image.py

        """
        setParams(self, inputCol=None, outputCol=None, modelName=None, decodePredictions=False,
-                  topK=5)
+                  scaleHint="SCALE_AREA_AVERAGING",topK=5)


space after ,

sueann · 2018-01-18T00:36:14Z

python/model_gen/generate_app_models.py

 """


-def gen_model(name, model, model_file, version=1, featurize=True):
+def indent(s, lvl):
+    return '\n'.join([' '*lvl + x for x in s.split('\n')])


space around *

sueann · 2018-01-18T00:37:09Z

oh and need to rerun travis

sueann · 2018-01-19T18:14:42Z

python/sparkdl/transformers/named_image.py

        """
-        setParams(self, inputCol=None, outputCol=None, modelName=None)
+        setParams(self, inputCol=None, outputCol=None, modelName=None, decodePredictions=False,


remove decodePredictions & topK (not sure if we talked about this - the comment here goes to docs).

oh yeah, that's a good catch, thanks

sueann · 2018-01-19T18:16:55Z

src/main/scala/com/databricks/sparkdl/DeepImageFeaturizer.scala

@@ -38,6 +39,9 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def

  final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
  final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
+  final val scaleHint: Param[String] = new Param(this,"scaleHint", "hint which method to use for resizing.",


space after this,

sueann · 2018-01-19T18:20:17Z

python/model_gen/generate_app_models.py

            "name": name,
            "height": model.inputShape()[0],
            "width": model.inputShape()[1],
            "version": version,
-            "base64": base64_hash})
+            "base64": base64_hash},2))


Mm not sure what's going on but the generated file's indent doesn't look right:

/** * What's in Models.scala */

/** * Expected */

sueann · 2018-01-19T18:23:07Z

python/tests/transformers/named_image_test.py

@@ -16,6 +16,8 @@
 import numpy as np
 import os

+from scipy import spatial


what's going on with the import ordering & grouping here 😂 (lines 16-25)

…pace around operators and arguments."

sueann

just one comment fix.

sueann · 2018-01-23T02:20:44Z

python/sparkdl/transformers/named_image.py

        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
-    def setParams(self, inputCol=None, outputCol=None, modelName=None):
+    def setParams(self, inputCol=None, outputCol=None, modelName=None, scaleHint="SCALE_AREA_AVERAGING"):
        """
        setParams(self, inputCol=None, outputCol=None, modelName=None)


add scaleHint="SCALE_AREA_AVERAGING"

…t parameter.

tomasatdatabricks requested review from MrBago and sueann December 16, 2017 00:19

sueann requested changes Dec 18, 2017

View reviewed changes

tomasatdatabricks force-pushed the tomas/ML_3150 branch 5 times, most recently from d2c8562 to 24f3290 Compare December 19, 2017 19:47

tomasatdatabricks added 2 commits December 19, 2017 14:52

Initial work on making python's DeepImageFeaturizer to use Scala vers…

73898fd

…ion.

DeepImageFeaturizer::Changed scaleFast Boolean flag to scaleHint Stri…

83a6ac9

…ng flag

tomasatdatabricks force-pushed the tomas/ML_3150 branch from 8b2d9ba to 59673d0 Compare December 19, 2017 22:54

MrBago requested changes Dec 20, 2017

View reviewed changes

sueann reviewed Dec 20, 2017

View reviewed changes

smurching self-requested a review December 21, 2017 01:04

smurching reviewed Dec 21, 2017

View reviewed changes

tomasatdatabricks force-pushed the tomas/ML_3150 branch from adce33d to 4656fbc Compare December 21, 2017 16:08

MrBago reviewed Jan 4, 2018

View reviewed changes

tomasatdatabricks changed the title ~~[WIP] Make python DeepImageFeaturizer use Scala version.~~ Make python DeepImageFeaturizer use Scala version. Jan 4, 2018

sueann requested changes Jan 13, 2018

View reviewed changes

tomasatdatabricks added 7 commits January 16, 2018 13:01

wip updates

a8b6ba9

(Re)Genereted model files with fixed formatting

dcc7131

Addressed some of reviewer's comments. One larger change: generate_ap…

dbbcd6b

…p_models.py script no longer modifies the Models.scala source direclty. Instead it generates a file in the current working directory and lets user copy it.

Adressed some of reviewer's comments.

dee337c

Few minor fixes, scaleHint converter in DeepImageFeaturizer is now lazy (and params are eagerly trasnfered to jvm) Added licences for generated named model wrappers.

moved generate_app_models.py to the python root

ad20ccc

Addressed revieweres comments. Mostly cosmetic changes, moved gen_app…

3350a91

…_model.py to python/model_gen/

tomasatdatabricks added 2 commits January 16, 2018 13:04

Added default for scale hint in DeepImageFeaturizer

b27ad7b

Added uid to DeepImageFeaturizer object creation in python

71c7b3e

tomasatdatabricks force-pushed the tomas/ML_3150 branch from 8fda2e8 to 10de54a Compare January 16, 2018 21:04

sueann reviewed Jan 18, 2018

View reviewed changes

sueann reviewed Jan 19, 2018

View reviewed changes

Addressed reviewer's comments. Fixed indentations and added missing s…

64a1d5c

…pace around operators and arguments."

tomasatdatabricks force-pushed the tomas/ML_3150 branch from ee5a0b2 to 64a1d5c Compare January 19, 2018 18:58

sueann reviewed Jan 23, 2018

View reviewed changes

Updated comment for DeepImageFeaturizer.setParams to include scaleHin…

daffce8

…t parameter.

sueann merged commit 12b2697 into databricks:master Jan 23, 2018

		return '/Users/tomas/dev/spark-deep-learning/python/tests/resources/images'


		def test_scala_vs_py():



		def gen_model(name, model, model_file, version=1, featurize=True):
		g = tf.Graph()

		import org.tensorframes.impl.DebugRowOps
		import org.tensorframes.{Shape, ShapeDescription}

		@@ -16,8 +16,13 @@
		from keras.applications.imagenet_utils import decode_predictions
		import numpy as np

		return dict(featurizer.scaleHintsJava()).keys()


		class _LazyCaleHintConverter:

Make python DeepImageFeaturizer use Scala version. #88

Make python DeepImageFeaturizer use Scala version. #88

Conversation

tomasatdatabricks commented Dec 16, 2017

sueann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 19, 2017 • edited

Codecov Report

MrBago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasatdatabricks Dec 21, 2017 • edited

Choose a reason for hiding this comment

tomasatdatabricks Dec 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasatdatabricks commented Dec 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrBago commented Dec 20, 2017

tomasatdatabricks commented Dec 21, 2017

smurching left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrBago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 19, 2017 •

edited

tomasatdatabricks Dec 21, 2017 •

edited

tomasatdatabricks Dec 21, 2017 •

edited