New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make python DeepImageFeaturizer use Scala version. #88
Conversation
tomasatdatabricks
commented
Dec 16, 2017
- Based of Image schema PR, do not merge until Image schema is merged.
- Otherwise mostly straightforward except results will not match keras in general due to different image libraries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an initial pass. In the future, could you fix up the obvious style issues before requesting review so we don't spend as much time on them (if it'll help, we can invest in putting in a linter for this repo), and pay a bit more attention to the readability? Thanks!
Also let's add @MrBago as a reviewer since he is very most familiar with these parts.
@@ -117,12 +117,10 @@ def getInputTensor(self): | |||
def getOutputTensor(self): | |||
tensor_name = self.getOrDefault(self.outputTensor) | |||
return self.getGraph().get_tensor_by_name(tensor_name) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep these newlines between functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, can you remove the changes in this file since they are not needed and irrelevant to the PR (sorry if I'm seeing weird changes because I'm using github wrong)
@@ -235,3 +216,45 @@ def _buildTFGraphForName(name, featurize): | |||
modelData["graph"] = graph | |||
|
|||
return modelData | |||
|
|||
class PyDeepImageFeaturizer(Transformer, HasInputCol, HasOutputCol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for testing? If so, can you comment and make it really clear it's not for normal usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still WIP, I have not decided what to do with Py version of the featurizer. I can mark it as test only.
@@ -182,7 +182,7 @@ def test_featurization(self): | |||
Tests that featurizer returns (almost) the same values as Keras. | |||
""" | |||
output_col = "prediction" | |||
transformer = DeepImageFeaturizer(inputCol="image", outputCol=output_col, | |||
transformer = PyDeepImageFeaturizer(inputCol="image", outputCol=output_col, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't test what we need to anymore. Let's discuss offline.
@@ -215,3 +215,24 @@ def test_featurizer_in_pipeline(self): | |||
pred_df_collected = lrModel.transform(train_df).collect() | |||
for row in pred_df_collected: | |||
self.assertEqual(int(row.prediction), row.label) | |||
|
|||
def test_scala_vs_py(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this function more readable by breaking down long lines and grouping lines into sections. i will review it after 😂
@@ -0,0 +1,105 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add license info here (see other files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need the top-level license for the file (in addition to the model-specific ones you have already put in below).
from hashlib import sha256 | ||
from base64 import b64encode | ||
|
||
def gen_model(name,model, model_file, version=1,featurize=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after 1,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix the spacing everywhere in this file
return '/Users/tomas/dev/spark-deep-learning/python/tests/resources/images' | ||
|
||
|
||
def test_scala_vs_py(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make this also more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same test as what's in the python portion? how/when is this meant to be called?
""" | ||
|
||
|
||
from hashlib import sha256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put all imports at the top of file
model_file.write(scala_template %{"name":name,"height":model.inputShape()[0],"width":model.inputShape()[1],"version":version,"base64":base64_hash}) | ||
return g2 | ||
|
||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import up top
@@ -38,6 +38,7 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def | |||
|
|||
final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name") | |||
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name") | |||
final val scaleFast: Param[Boolean] = new Param[Boolean](this,"scaleFast","use fast resizing if set.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "fast resizing" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just a flag singnaling to the resizer to use the other resize method. Not sure how to name it it's not mapped to a single flag on the java side either. SCALE_FAST or SCALE_DEFAULT would work, probably a few others as well. the name is WIP
d2c8562
to
24f3290
Compare
Codecov Report
@@ Coverage Diff @@
## master #88 +/- ##
==========================================
- Coverage 82.49% 82.44% -0.05%
==========================================
Files 33 34 +1
Lines 1879 1937 +58
Branches 35 36 +1
==========================================
+ Hits 1550 1597 +47
- Misses 329 340 +11
Continue to review full report at Codecov.
|
8b2d9ba
to
59673d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two high level concerns about this PR.
First, the auto generation of Models.scala file seems fickle and that code doesn't have any unit tests. It seems very likely that we'll need to change it next time someone needs to run it. For example, if we need to update a single model the script will roll all the version numbers so we either will need to upload a bunch of unchanged models or manually mess with model numbers. My suggestion is that we isolate this code and put it somewhere outside the python packages (executable scripts shouldn't go in the package anyways) and formalize the process for adding new models in a follow task.
Second, the testing in DLP is a total beast and we're making it worse. This is a pretty small package and our CI tests take like 4 hours, is that right? I can't get the tests to pass on my local machine and I've been trying. This PR now requires all the tests to download all the models in order to run the tests. I think we should think about how we can test the correctness of the python code that runs the tensorflow featurizers without running every single tensorflow featurizer.
We should also test the graph files we export, but the hashes will ensure those don't change so we don't need to test that every time we run the test suite.
to the image column in DataFrame. The output is a MLlib Vector so that DeepImageFeaturizer | ||
can be used in a MLlib Pipeline. | ||
The input image column should be 3-channel SpImage. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing the doc string? I think all public classes should have docstrings.
scalaFeaturizer.setModelName(self.getOrDefault(self.modelName)) | ||
scalaFeaturizer.setInputCol(self.getOrDefault(self.inputCol)) | ||
scalaFeaturizer.setOutputCol(self.getOrDefault(self.outputCol)) | ||
if(self.isDefined(self.scaleHint)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't usually put () around if condition in python, eg:
if condition:
pass
scalaFeaturizer.setInputCol(self.getOrDefault(self.inputCol)) | ||
scalaFeaturizer.setOutputCol(self.getOrDefault(self.outputCol)) | ||
if(self.isDefined(self.scaleHint)): | ||
scalaFeaturizer.setResizeFlag(self.getOrDefault(self.scaleHint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is scaleHint
treated differently than the other params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaleHint is optional, (input/output)col are not.
The isDefined call is so that we don't override scala's default.
return self._set(modelName=value) | ||
|
||
def getModelName(self): | ||
return self.getOrDefault(self.modelName) | ||
|
||
def _transform(self, dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we inherit fromJavaTransformer
and use the default _transform
implementation here. Was that considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware of JavaTransformer. I'll look into it.
outTensor = tf.to_double(tf.reshape(m.output, [-1]), name="%s_sparkdl_output__" % name) | ||
gdef = tfx.strip_and_freeze_until([outTensor], session.graph, session, False) | ||
g2 = tf.Graph() | ||
with g2.as_default(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need nested with graph
.as_default():` blocks? The semantics of this isn't clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do need a new graph into which we read the definition. Otherwise it imports it into the existing graph.
python/tests/tests.py
Outdated
@@ -42,7 +42,7 @@ class PythonUnitTestCase(unittest.TestCase): | |||
class TestSparkContext(object): | |||
@classmethod | |||
def setup_env(cls): | |||
cls.sc = SparkContext('local[*]', cls.__name__) | |||
cls.sc = SparkContext.getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? I think TestSparkContext
is expected to create a new context for each test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope I checked with Philip who is teh author of that line and i is fine this way.
The reason is that I added test which runs prior to this one and which creates a spark context, in which case this test fails with too many spark contexts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now we run each individual test file separately https://github.com/databricks/spark-deep-learning/blob/master/python/run-tests.sh#L100-L109 in our Travis CI.
(Correct me if I am wrong) I think creating new ones or not shouldn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It matters when you run the tests locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ I agree. When testing locally and running multiple tests simultaneously, there will be conflict. What I was trying to say is that I support the way you are changing the spark context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ... I don't think you can define the params in the constructor (I'd like to know if you can) because I think the HasParams class uses inspection on the class to discover all the params. We might be able to validate the param values using a function that caches the java content the first time it's run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait ... does the validator on scala Param not get run until stransform is called? I might look into fixing that after this release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The params are not transferred to scala until you call _transform. At least that's how it looks to me after a quick look. I can transfer them eagerly in setParams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O both work. I can either
- leave validation to scala and eagerly transfer params
or - add converter with lazy initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think within mllib lazy
initialization is mostly used, I'm not sure why that's the case.
imageDf = imageDf.coalesce(self.numPartitionsOverride) | ||
|
||
transformer = DeepImageFeaturizer(inputCol='image', modelName=self.name, | ||
outputCol="features") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
@@ -147,7 +162,7 @@ object DeepImageFeaturizer extends DefaultParamsReadable[DeepImageFeaturizer] { | |||
|
|||
// TODO: support batched graphs with mapBlocks | |||
|
|||
private[sparkdl] trait NamedImageModel { | |||
protected[sparkdl] trait NamedImageModel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Do we expect this to be changed in subclasses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to access it from the Models.scala file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think private[sparkdl]
means you can access it from within the sparkdl
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, good catch
@@ -95,14 +104,20 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def | |||
this | |||
} | |||
|
|||
def setResizeFlag(value: String): this.type = { | |||
set(scaleHint, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always use the naming convention setX
for param X
. Could this be renamed to setScaleHint
or could we rename the param as resizeFlag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've renamed resizeFlag to scaleHint and forgot to change the method. Good catch
@@ -113,13 +115,15 @@ private[sparkdl] object ImageUtils { | |||
* @param tgtChannels number of channels of output image (must be 3), may be used later to | |||
* support more channels. | |||
* @param spImage image to resize. | |||
* @param scaleHint hint which algorhitm to use, see java.awt.Image#SCALE_DEFAULT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggests that our default is SCALE_DEFAULT
, but I don't think we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I think SCALE_DEFAULT
might be implementation specific for java.awt.Image subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I randomly picked one, it is not supposed to suggest what the default value is. I can replace it with default.
As for the high level comments:
|
|
||
|
||
def gen_model(name, model, model_file, version=1, featurize=True): | ||
g = tf.Graph() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to read & understand that there is no scope overlap:
g = tf.Graph()
with tf.Session(graph=g) as sess:
...
gdef = tfx.strip_and_freeze_until([outTensor], sess.graph, sess, return_graph=False)
g2 = tf.Graph()
with tf.Session(graph=g2) as sess: # i believe `with g2.as_default():` instead of initializing a session would also work here but I'd do it outside of the previous session for clarity.
tf.import_graph_def(gdef, name='')
filename = "sparkdl-%s_v%d.pb" % (name, version)
print 'writing out ', filename
tf.train.write_graph(g2.as_graph_def(), logdir="./", name=filename, as_text=False)
# I'd return here and deal with the scala file writing elsewhere, but if you want to put anything more, you can do it outside the second session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's way cleaner, thanks.
One more thing, I know that we've already talked about style but I don't know if the linter will catch this. We should be consistent about using camelCase as much as possible. And also lets try and be consistent with indentation:
(We had a conversation about adopting a single indentation style for the whole project back in Jun, and while I think there were some good arguments for that we decided it would best to stick to the predominant style in each of the two languages.) |
Ok I've pushed updated version. Biggest change - I've changed the gen_app_models.py to not modify the Models.scala directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, left a few comments
kerasPredict = self.kerasPredict | ||
|
||
def rowWithImage(img): | ||
# return [imageIO.imageArrayToStruct(img.astype('uint8'), imageType.sparkMode)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line?
def rowWithImage(img): | ||
# return [imageIO.imageArrayToStruct(img.astype('uint8'), imageType.sparkMode)] | ||
row = imageIO.imageArrayToStruct(img.astype('uint8')) | ||
# re-order row to avoid pyspark bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: If there's a JIRA ticket for the bug note the ticket name (e.g. "SPARK-xxxx") in this comment?
@@ -38,6 +41,9 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def | |||
|
|||
final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name") | |||
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name") | |||
final val scaleHint: Param[String] = new Param(this,"scaleHint","hint which method to use for resizing.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Spaces after commas 🙃
@@ -95,14 +104,20 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def | |||
this | |||
} | |||
|
|||
def getScaleHint(value: String): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should this be setScaleHint
(not getScaleHint
)?
self.setParams(**kwargs) | ||
|
||
@keyword_only | ||
def setParams(self, inputCol=None, outputCol=None, modelName=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to implement setParams
along with Python getters/setters for scaleHint
and modelName
. You can look at the PySpark's Bucketizer implementation as an example: https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little strange that our unit tests didn't catch this, in PySpark we run doctests that attempt to get/set params but I guess we don't have equivalent tests in DLP. Would be good to add such tests at some point, probably in a future PR...cc @MrBago
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I did not know this was required. I saw it somewhere in the code but I did not see a use for it. I'll add it, thanks.
adce33d
to
4656fbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does test_featurization_no_reshape
require us to download all the model graphs in order run our unit tests?
That's probably ok for now, but I think we should follow up soon with a task to restructure the testing a bit. It would be really nice to get the testing time down for our primary test suite.
python/generate_app_models.py
Outdated
@@ -0,0 +1,142 @@ | |||
#!/bin/python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard python shebang is #!/usr/bin/env python
python/generate_app_models.py
Outdated
@@ -0,0 +1,142 @@ | |||
#!/bin/python | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this file to python/scripts/
or python/model_gen
?
""" | ||
|
||
modelName = Param(Params._dummy(), "modelName", "A deep learning model name", | ||
typeConverter=SparkDLTypeConverters.buildSupportedItemConverter(SUPPORTED_MODELS)) | ||
|
||
scaleHint = Param(Params._dummy(), "scaleHint", "Hint which algorhitm to use for image resizing", | ||
typeConverter=_scaleHintConverter) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to add getter and setter for scaleHint & modelName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, already added.
modelName=self.getModelName(), featurize=True) | ||
return transformer.transform(dataset) | ||
# TODO: give an option to take off multiple layers so it can be used in tuning | ||
# (could be the name of the layer or int for how many to take off). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems kind of a big task for a source comment, should we maybe track this in an issue or jira instead?
python/tests/graph/test_import.py
Outdated
@@ -86,7 +86,8 @@ def test_saved_graph_novar(self): | |||
|
|||
def gin_fun(session): | |||
_build_saved_model(session, saved_model_dir) | |||
return TFInputGraph.fromGraph(session.graph, session, [_tensor_input_name], [_tensor_output_name]) | |||
return TFInputGraph.fromGraph(session.graph, session, [ | |||
_tensor_input_name], [_tensor_output_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: can we move [
to next line?
import org.tensorframes.impl.DebugRowOps | ||
import org.tensorframes.{Shape, ShapeDescription} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the re-ordering of all the imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result Idea's optimize imports. I guess it puts them in alphabetical order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original is the style we're using in the repo (following Databricks standards). You can setup your IntelliJ to use our convention: https://github.com/databricks/scala-style-guide#imports
@@ -0,0 +1,105 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
import java.awt.image.BufferedImage | ||
import java.awt.{Color, Image} | ||
|
||
import com.sun.javafx.iio.ImageStorage.ImageType | ||
import org.apache.spark.ml.image.ImageSchema | ||
import org.apache.spark.sql.Row | ||
import org.apache.spark.sql.expressions.UserDefinedFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets drop the unused imports here:
awt
, sql.function
, sql. expressions
and com.sum
.
def transform(dataFrame: Dataset[_]): DataFrame = { | ||
validateSchema(dataFrame.schema) | ||
val model = DeepImageFeaturizer.supportedModelMap(getModelName) | ||
|
||
val imSchema = ImageSchema.columnSchema | ||
val height = model.height | ||
val width = model.width | ||
val resizeUdf = udf((image: Row) => ImageUtils.resizeImage(height, width, 3, image), imSchema) | ||
|
||
val resizeUdf = udf((image: Row) => ImageUtils.resizeImage(height, width, 3, image, DeepImageFeaturizer.scaleHints(getScaleHint)), imSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length
kerasReshaped[i], | ||
features_sc[i]) for i in range( | ||
len(features_sc))] | ||
np.testing.assert_array_almost_equal([0 for i in range(len(features_sc))], diffs, decimal=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy lets you compare scalers to arrays:
np.testing.assert_array_almost_equal(diffs, 0., decimal=2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good to me. Some changes requested (they shouldn't require much logic change).
Summarizing the tests for my sake, we have (please correct me if I'm wrong):
- In Python, for all application models:
- without resizing, Keras <-> DeepImageFeaturizer equivalence
- with resizing, Keras <-> DeepImageFeaturizer proximity via cosine distance (+ one-time test via transfer learning with DeepImageFeaturizer features)
- In Scala, unit tests for DeepImageFeaeturizer on a simple tf graph.
The Python tests are essentially integration tests for all the models in DeepImageFeaturizer.
# | ||
# Takes keras models in sparkdl.transformers.keras_applications and prepends reshaping from ImageSchema | ||
# and model specific preprocessing. | ||
# Produces tensor flow model files and a scala file containing scala wrappers for all the models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tensor flow -> TensorFlow
# 1. model *.pb files (need to be uploaded to S3) . | ||
# 2. generated scala model wrappers Models.scala.generated (needs to be moved over to appropriate scala folder) | ||
# | ||
from base64 import b64encode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert space before
# and model specific preprocessing. | ||
# Produces tensor flow model files and a scala file containing scala wrappers for all the models. | ||
# | ||
# Input: sparkdl.transformers.keras_aplications.KERAS_APPLICATION_MODELS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keras_aplications -> keras_applications
this is not actually input by user, right? it's read automatically by the script. let's make that clear.
@@ -16,8 +16,13 @@ | |||
from keras.applications.imagenet_utils import decode_predictions | |||
import numpy as np | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove newline
return dict(featurizer.scaleHintsJava()).keys() | ||
|
||
|
||
class _LazyCaleHintConverter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_LazyScaleHintConverter
(typo) ?
import org.tensorframes.impl.DebugRowOps | ||
import org.tensorframes.{Shape, ShapeDescription} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original is the style we're using in the repo (following Databricks standards). You can setup your IntelliJ to use our convention: https://github.com/databricks/scala-style-guide#imports
@@ -0,0 +1,105 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need the top-level license for the file (in addition to the model-specific ones you have already put in below).
'', | ||
' private[sparkdl] object TestNet extends NamedImageModel {', | ||
' /**', | ||
' * A simple test graph used for testing DeepImageFeaturizer', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
} | ||
|
||
/** | ||
* Model provided by Keras. All cotributions by Keras are provided subject to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the import reordering (git does not let me responded above :/ )
Thanks, the Idea setup helps! I think Idea actually did the right thing in alphabetizing imports from third party libraries. The only violations of the rules is the scala.* reference at the bottom.
} | ||
|
||
/** | ||
* Model provided by Keras. All cotributions by Keras are provided subject to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
Due to the difference in images resized by different libraries, DeepImageFeaturizer is no longer required to match the results from keras on raw (non-resized) images. Instead of computing l-inf or l2 norm of the two feature vectors, we compare their cosine distance and require it to be sufficiently "low" (< 1e-2). We also ran several transfer learning examples and ensured that the results were comparable. These experiments were successfull and new sparkdl's features proved to be at least as good as native keras ones, however, they have not been added as automated tests. Overall I think the combination of (1) you get exact match with no resize and (2) not too different with resize is good enough. Cosine distance justification: Cosine distance ensures that the resulting feature vector has similar direction. Intuitively, this is important property for the generated features and I think ensuring cosine distance is low enough gives better guarantees than computing l2 or l-inf norm and comparing with huge allowed diff. I did comparisons to different images, various amounts of added noise and some obvious bugs I could think of such as skipping the preprocess or having the color channels flipped. Most distances came orders of magnitude higher, noise with sd = 0.01 got comparable distance. Here's the breakdown on the test images: (the distance metric is by definition from [0,1] interval): cosine distance per image to the same image with added (normal, mean = 0) noise: sd = 1.00: [0.69, 0.78, 0.77 0.75, 0.76] sd = 0.10: [0.1, 0.2, 0.31, 0.12, 0.23] sd = 0.01: [0.0078, 0.0094, 0.060, 0.0040, 0.0085] cosine distance with no preprocessing: all ~ 0.9 cosine distance with faulty preprocess (mean of one channel is incorrect): all ~ 0.1 cosine distance with flipped channels: all ~ 0.3 cosine distance matrix for the test images: [ 0.00 0.75 0.71 0.70 0.67] [ 0.75 0.00 0.79 0.85 0.74] [ 0.71 0.80 0.00 0.75 0.69] [ 0.70 0.85 0.75 0.00 0.70] [ 0.67 0.74 0.69 0.70 0.00]
…p_models.py script no longer modifies the Models.scala source direclty. Instead it generates a file in the current working directory and lets user copy it.
Few minor fixes, scaleHint converter in DeepImageFeaturizer is now lazy (and params are eagerly trasnfered to jvm) Added licences for generated named model wrappers.
…_model.py to python/model_gen/
8fda2e8
to
10de54a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just small comments. lgtm otherwise. thanks!
"name": name, | ||
"height": model.inputShape()[0], | ||
"width": model.inputShape()[1], | ||
"version": version, | ||
"base64": base64_hash}) | ||
"base64": base64_hash},2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after ,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might need to be 1 instead of 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 2 is correct, it's the number of spaces for the indent, not a number of indent levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm not sure what's going on but the generated file's indent doesn't look right:
/**
* What's in Models.scala
*/
/**
* Expected
*/
g = gen_model(name=name, model=modelConstructor(), model_file=f) | ||
if not name in licenses: | ||
raise KeyError("Missing license for model '%s'" % name ) | ||
g = gen_model(license = licenses[name],name=name, model=modelConstructor(), model_file=f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after ,
""" | ||
setParams(self, inputCol=None, outputCol=None, modelName=None, decodePredictions=False, | ||
topK=5) | ||
scaleHint="SCALE_AREA_AVERAGING",topK=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after ,
""" | ||
|
||
|
||
def gen_model(name, model, model_file, version=1, featurize=True): | ||
def indent(s, lvl): | ||
return '\n'.join([' '*lvl + x for x in s.split('\n')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space around *
oh and need to rerun travis |
""" | ||
setParams(self, inputCol=None, outputCol=None, modelName=None) | ||
setParams(self, inputCol=None, outputCol=None, modelName=None, decodePredictions=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove decodePredictions & topK (not sure if we talked about this - the comment here goes to docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, that's a good catch, thanks
@@ -38,6 +39,9 @@ class DeepImageFeaturizer(override val uid: String) extends Transformer with Def | |||
|
|||
final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name") | |||
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name") | |||
final val scaleHint: Param[String] = new Param(this,"scaleHint", "hint which method to use for resizing.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after this,
"name": name, | ||
"height": model.inputShape()[0], | ||
"width": model.inputShape()[1], | ||
"version": version, | ||
"base64": base64_hash}) | ||
"base64": base64_hash},2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm not sure what's going on but the generated file's indent doesn't look right:
/**
* What's in Models.scala
*/
/**
* Expected
*/
@@ -16,6 +16,8 @@ | |||
import numpy as np | |||
import os | |||
|
|||
from scipy import spatial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's going on with the import ordering & grouping here 😂 (lines 16-25)
…pace around operators and arguments."
ee5a0b2
to
64a1d5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one comment fix.
kwargs = self._input_kwargs | ||
self.setParams(**kwargs) | ||
|
||
@keyword_only | ||
def setParams(self, inputCol=None, outputCol=None, modelName=None): | ||
def setParams(self, inputCol=None, outputCol=None, modelName=None, scaleHint="SCALE_AREA_AVERAGING"): | ||
""" | ||
setParams(self, inputCol=None, outputCol=None, modelName=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add scaleHint="SCALE_AREA_AVERAGING"