Fix load of models that depend on non thread-safe dependencies #27

paulojrp · 2018-12-07T19:06:52Z

A problem was detected when loading TensorFlow models in different threads inside the same JVM. That happened after load a TensorFlow model and then try to import a new TensorFlow model. This was caused by a dependency of TensorFlow (protobuf) that was being reloaded but it already existed in the JVM (through the 1st thread).

The workaround was to share the problematic module (Tensorflow) across the sub-interpreters of Python. This is a workaround for the issues with CPython extensions.

paulojrp · 2018-12-07T19:07:32Z

at the moment I only tested this with UTs, before landing this I want to make sure that it works in a real environment

TravisBuddy · 2018-12-07T19:10:43Z

Travis tests have failed

Hey @paulojrp,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: b182d090-fa53-11e8-8526-c1031fffd6c8

pedrorijo91 · 2018-12-07T20:39:46Z

I suppose this is related with #26 ?

openml-python-common/src/main/java/com/feedzai/openml/python/jep/instance/JepInstance.java

pedrorijo91 · 2018-12-07T20:51:33Z

openml-python-common/src/main/java/com/feedzai/openml/python/jep/instance/JepInstance.java

+    /**
+     * Private constructor for singleton pattern.
+     *
+     * @implNote {@link Jep} is initalized inside the current JVM. Due to the need to manage a consistent Python thread


kudos for the explanation! IMHO I would prefer to keep the explanation generic instead of base it on the specific case we caught (and then with a reference to the specific github issue), but it's a super minor nitpick, and please let anyone else comment on this before changing anything

pedrorijo91

A quick review seems good. will make a deeper review monday. But there are tests failing

nmldiegues · 2018-12-07T21:01:41Z

A clear disadvantage of this approach is that it never "releases" models already loaded. E.g., consider a JVM where you load and close 10000 models sequentially; this will eventually go OOM.
I'm thinking: could you maybe cache the model loading in the ClassificationPythonModel, and in the "close" just "null out" the variable (for Python GC to kick in)? That would probably make this solution perfect.

Quick note: the docker image used in travis is not set up with tensorflow, but I see you added as UT the regression test for it. Nevertheless there are some failed UTs. I'll let you take a look at it.

paulojrp · 2018-12-08T23:40:04Z

I suppose this is related with #26 ?

yes this commit fixed that issue

paulojrp · 2018-12-09T00:10:55Z

A clear disadvantage of this approach is that it never "releases" models already loaded. E.g., consider a JVM where you load and close 10000 models sequentially; this will eventually go OOM.
I'm thinking: could you maybe cache the model loading in the ClassificationPythonModel, and in the "close" just "null out" the variable (for Python GC to kick in)? That would probably make this solution perfect.

Quick note: the docker image used in travis is not set up with tensorflow, but I see you added as UT the regression test for it. Nevertheless there are some failed UTs. I'll let you take a look at it.

Another problem is the names of to methods responsible for the classification of events. The implementation of the providers is assuming that their names are constantes ("classify" and "getClassification"). This logic won't work if we only use one python interpreter. I will explore another solutions.

codecov · 2018-12-10T23:39:12Z

Codecov Report

Merging #27 into master will decrease coverage by 50.96%.
The diff coverage is 100%.

@@              Coverage Diff              @@
##             master      #27       +/-   ##
=============================================
- Coverage       100%   49.03%   -50.97%     
- Complexity       22       37       +15     
=============================================
  Files             6       11        +5     
  Lines            42      208      +166     
  Branches          0        8        +8     
=============================================
+ Hits             42      102       +60     
- Misses            0      104      +104     
- Partials          0        2        +2

Impacted Files	Coverage Δ	Complexity Δ
...zai/openml/python/modules/SharedModulesParser.java	`100% <100%> (ø)`	`7 <7> (?)`
...eedzai/openml/python/jep/instance/JepInstance.java	`86.48% <100%> (ø)`	`6 <1> (?)`
...nml/python/jep/instance/AbstractJepEvaluation.java	`66.66% <0%> (ø)`	`2% <0%> (?)`
...edzai/openml/python/ClassificationPythonModel.java	`0% <0%> (ø)`	`0% <0%> (?)`
...n/AbstractClassificationPythonModelLoaderImpl.java	`0% <0%> (ø)`	`0% <0%> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f2d92f...1337706. Read the comment docs.

nmldiegues · 2018-12-11T09:39:32Z

This is pretty nice @paulojrp . Of course it has the downside that it requires code changes for every new library with the same problem, but it is an acceptable trade-off. I trust that you ran the UT regression of #26 and it passed?

I just have 1 suggestion: Please make the list of packages added to shared modules read dynamically (environment variable? a resource file inside the openml jar? please consider remote spark executions when devising the solution) --- this will make users autonomous to overcome this problem for a new library in the future; please add a note to the README about the #26 and this solution

pedrorijo91

overall looks great! :) just a few minor style issues. Feel free to disagree with them

pedrorijo91 · 2018-12-14T10:05:17Z

openml-python-common/README.md

@@ -40,3 +40,15 @@ export PATH=$ANACONDA_PATH/envs/myenv/bin:$PATH
 export LD_LIBRARY_PATH=$ANACONDA_PATH/envs/myenv/lib/python3.6/site-packages/jep:$LD_LIBRARY_PATH
 export LD_PRELOAD=$ANACONDA_PATH/envs/myenv/lib/libpython3.6m.so
 ```
+
+7. If you need to share Python modules across sub-interpreters, you would need to create a "python-packages.xml" file where you define the modules to be shared. By default the provider is already sharing the "numpy" and "tensorflow" modules. This is a workaround for the issues with CPython extensions.


I think it should be 'you will need to create a (...)'. @krisztinaknagy to validate this one

pedrorijo91 · 2018-12-14T10:06:27Z

openml-python-common/README.md

@@ -40,3 +40,15 @@ export PATH=$ANACONDA_PATH/envs/myenv/bin:$PATH
 export LD_LIBRARY_PATH=$ANACONDA_PATH/envs/myenv/lib/python3.6/site-packages/jep:$LD_LIBRARY_PATH
 export LD_PRELOAD=$ANACONDA_PATH/envs/myenv/lib/libpython3.6m.so
 ```
+
+7. If you need to share Python modules across sub-interpreters, you would need to create a "python-packages.xml" file where you define the modules to be shared. By default the provider is already sharing the "numpy" and "tensorflow" modules. This is a workaround for the issues with CPython extensions.


is there any official documentation for this python-packages.xml?

bellow you can find an example of the contents of this file

openml-python-common/src/main/java/com/feedzai/openml/python/jep/instance/JepInstance.java

openml-python-common/src/main/java/com/feedzai/openml/python/xml/parser/XMLParser.java

openml-python-common/src/test/java/com/feedzai/openml/python/xml/parser/XMLParserTest.java

pedrorijo91

actually, I would expect a UT like the one described in #26 . Is it too hard to automate?

TravisBuddy · 2018-12-14T11:53:08Z

Hey @paulojrp,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: d0ff2700-ff96-11e8-88a0-5d55312123ee

nmldiegues

@pedrorijo91 for adding that UT we'd need tensorflow to be added to the Docker image used in Travis (and then to the base openml image used inside Feedzai). I'm not sure we want to keep adding all the stuff to the images, they'll grow endlessly.

@paulojrp Reviewed, mostly minor comments: please check the coverage and improve it, and tackle my comments. Thanks!

openml-python-common/src/main/java/com/feedzai/openml/python/jep/instance/JepInstance.java

openml-python-common/src/main/java/com/feedzai/openml/python/modules/SharedModulesParser.java

openml-python-common/src/main/java/com/feedzai/openml/python/modules/package-info.java

openml-python-common/src/main/java/com/feedzai/openml/python/modules/SharedModulesParser.java

...l-python-common/src/test/java/com/feedzai/openml/python/modules/SharedModulesParserTest.java

pedrorijo91 · 2018-12-14T13:38:40Z

can't we simply create a custom model that loads python stuff @nmldiegues ? anyway, if it's too time/effort-consuming, let's be pragmatic and go without it :)

A problem was detected when loading TensorFlow models in different threads inside the same JVM. That happened after load a TensorFlow model and then try to import a new TensorFlow model. This was caused by a dependency of TensorFlow (protobuf) that was being reloaded but it already existed in the JVM (through the 1st thread). The workaround was to share the problematic module (Tensorflow) across the sub-interpreters of Python. This is a workaround for the issues with CPython extensions.

TravisBuddy · 2018-12-14T14:51:06Z

Hey @paulojrp,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: ade3d770-ffaf-11e8-88a0-5d55312123ee

paulojrp · 2018-12-14T15:09:31Z

@nmldiegues I created a new UT but the coverage is still decreasing a lot, I don't see how this is related with my modifications.

@pedrorijo91 to reproduce this problem we only had to load a python file that imports tensorflow. The problem with a unit-test for this case is that we would have to modify the docker image to also install tensorflow. we could do it but then what would happen when new packages with these problems are found? (modify the image to also check that). Don't forget that we don't have an openml provider for tensorflow.
Also this modification was tested manually by loading an tensorflow model and by using it to score events.

pedrorijo91

all good on my side then :)

pedrorijo91 reviewed Dec 7, 2018

View reviewed changes

openml-python-common/src/main/java/com/feedzai/openml/python/jep/instance/JepInstance.java Outdated Show resolved Hide resolved

pedrorijo91 reviewed Dec 7, 2018

View reviewed changes

paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from 01865e4 to e3c6a86 Compare December 10, 2018 23:35

feedzai deleted a comment Dec 10, 2018

paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from e3c6a86 to 0214701 Compare December 11, 2018 19:17

pedrorijo91 added this to the next version milestone Dec 12, 2018

paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from 0214701 to a34f327 Compare December 13, 2018 13:39

feedzai deleted a comment Dec 13, 2018

paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from a34f327 to b70673c Compare December 13, 2018 17:17

feedzai deleted a comment Dec 13, 2018

feedzai deleted a comment from TravisBuddy Dec 14, 2018

pedrorijo91 reviewed Dec 14, 2018

View reviewed changes

paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from b70673c to 3c03a58 Compare December 14, 2018 11:50

nmldiegues reviewed Dec 14, 2018

View reviewed changes

paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from 3c03a58 to 1337706 Compare December 14, 2018 14:47

pedrorijo91 approved these changes Dec 14, 2018

View reviewed changes

nmldiegues approved these changes Dec 14, 2018

View reviewed changes

paulojrp merged commit c064e59 into master Dec 14, 2018

pedrorijo91 deleted the fix-model-with-non-thread-safe-dependencies branch December 14, 2018 16:33

paulojrp mentioned this pull request Dec 14, 2018

Loading a tensor flow model twice fails #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix load of models that depend on non thread-safe dependencies #27

Fix load of models that depend on non thread-safe dependencies #27

paulojrp commented Dec 7, 2018 •

edited

paulojrp commented Dec 7, 2018

TravisBuddy commented Dec 7, 2018

pedrorijo91 commented Dec 7, 2018

pedrorijo91 Dec 7, 2018

pedrorijo91 left a comment •

edited

nmldiegues commented Dec 7, 2018

paulojrp commented Dec 8, 2018

paulojrp commented Dec 9, 2018

codecov bot commented Dec 10, 2018 •

edited

nmldiegues commented Dec 11, 2018

pedrorijo91 left a comment

pedrorijo91 Dec 14, 2018

pedrorijo91 Dec 14, 2018

paulojrp Dec 14, 2018

pedrorijo91 left a comment

TravisBuddy commented Dec 14, 2018

nmldiegues left a comment

pedrorijo91 commented Dec 14, 2018

TravisBuddy commented Dec 14, 2018

paulojrp commented Dec 14, 2018 •

edited

pedrorijo91 left a comment

Fix load of models that depend on non thread-safe dependencies #27

Fix load of models that depend on non thread-safe dependencies #27

Conversation

paulojrp commented Dec 7, 2018 • edited

paulojrp commented Dec 7, 2018

TravisBuddy commented Dec 7, 2018

Travis tests have failed

TravisBuddy Request Identifier: b182d090-fa53-11e8-8526-c1031fffd6c8

pedrorijo91 commented Dec 7, 2018

pedrorijo91 Dec 7, 2018

Choose a reason for hiding this comment

pedrorijo91 left a comment • edited

Choose a reason for hiding this comment

nmldiegues commented Dec 7, 2018

paulojrp commented Dec 8, 2018

paulojrp commented Dec 9, 2018

codecov bot commented Dec 10, 2018 • edited

Codecov Report

nmldiegues commented Dec 11, 2018

pedrorijo91 left a comment

Choose a reason for hiding this comment

pedrorijo91 Dec 14, 2018

Choose a reason for hiding this comment

pedrorijo91 Dec 14, 2018

Choose a reason for hiding this comment

paulojrp Dec 14, 2018

Choose a reason for hiding this comment

pedrorijo91 left a comment

Choose a reason for hiding this comment

TravisBuddy commented Dec 14, 2018

TravisBuddy Request Identifier: d0ff2700-ff96-11e8-88a0-5d55312123ee

nmldiegues left a comment

Choose a reason for hiding this comment

pedrorijo91 commented Dec 14, 2018

TravisBuddy commented Dec 14, 2018

TravisBuddy Request Identifier: ade3d770-ffaf-11e8-88a0-5d55312123ee

paulojrp commented Dec 14, 2018 • edited

pedrorijo91 left a comment

Choose a reason for hiding this comment

paulojrp commented Dec 7, 2018 •

edited

pedrorijo91 left a comment •

edited

codecov bot commented Dec 10, 2018 •

edited

paulojrp commented Dec 14, 2018 •

edited