Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix load of models that depend on non thread-safe dependencies #27

Merged
merged 1 commit into from Dec 14, 2018

Conversation

paulojrp
Copy link
Contributor

@paulojrp paulojrp commented Dec 7, 2018

A problem was detected when loading TensorFlow models in different threads inside the same JVM. That happened after load a TensorFlow model and then try to import a new TensorFlow model. This was caused by a dependency of TensorFlow (protobuf) that was being reloaded but it already existed in the JVM (through the 1st thread).

The workaround was to share the problematic module (Tensorflow) across the sub-interpreters of Python. This is a workaround for the issues with CPython extensions.

@paulojrp
Copy link
Contributor Author

paulojrp commented Dec 7, 2018

at the moment I only tested this with UTs, before landing this I want to make sure that it works in a real environment

@TravisBuddy
Copy link

Travis tests have failed

Hey @paulojrp,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: b182d090-fa53-11e8-8526-c1031fffd6c8

@pedrorijo91
Copy link
Contributor

I suppose this is related with #26 ?

/**
* Private constructor for singleton pattern.
*
* @implNote {@link Jep} is initalized inside the current JVM. Due to the need to manage a consistent Python thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kudos for the explanation! IMHO I would prefer to keep the explanation generic instead of base it on the specific case we caught (and then with a reference to the specific github issue), but it's a super minor nitpick, and please let anyone else comment on this before changing anything

Copy link
Contributor

@pedrorijo91 pedrorijo91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick review seems good. will make a deeper review monday. But there are tests failing

@nmldiegues
Copy link
Contributor

A clear disadvantage of this approach is that it never "releases" models already loaded. E.g., consider a JVM where you load and close 10000 models sequentially; this will eventually go OOM.
I'm thinking: could you maybe cache the model loading in the ClassificationPythonModel, and in the "close" just "null out" the variable (for Python GC to kick in)? That would probably make this solution perfect.

Quick note: the docker image used in travis is not set up with tensorflow, but I see you added as UT the regression test for it. Nevertheless there are some failed UTs. I'll let you take a look at it.

@paulojrp
Copy link
Contributor Author

paulojrp commented Dec 8, 2018

I suppose this is related with #26 ?

yes this commit fixed that issue

@paulojrp
Copy link
Contributor Author

paulojrp commented Dec 9, 2018

A clear disadvantage of this approach is that it never "releases" models already loaded. E.g., consider a JVM where you load and close 10000 models sequentially; this will eventually go OOM.
I'm thinking: could you maybe cache the model loading in the ClassificationPythonModel, and in the "close" just "null out" the variable (for Python GC to kick in)? That would probably make this solution perfect.

Quick note: the docker image used in travis is not set up with tensorflow, but I see you added as UT the regression test for it. Nevertheless there are some failed UTs. I'll let you take a look at it.

Another problem is the names of to methods responsible for the classification of events. The implementation of the providers is assuming that their names are constantes ("classify" and "getClassification"). This logic won't work if we only use one python interpreter. I will explore another solutions.

@paulojrp paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from 01865e4 to e3c6a86 Compare December 10, 2018 23:35
@codecov
Copy link

codecov bot commented Dec 10, 2018

Codecov Report

Merging #27 into master will decrease coverage by 50.96%.
The diff coverage is 100%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master      #27       +/-   ##
=============================================
- Coverage       100%   49.03%   -50.97%     
- Complexity       22       37       +15     
=============================================
  Files             6       11        +5     
  Lines            42      208      +166     
  Branches          0        8        +8     
=============================================
+ Hits             42      102       +60     
- Misses            0      104      +104     
- Partials          0        2        +2
Impacted Files Coverage Δ Complexity Δ
...zai/openml/python/modules/SharedModulesParser.java 100% <100%> (ø) 7 <7> (?)
...eedzai/openml/python/jep/instance/JepInstance.java 86.48% <100%> (ø) 6 <1> (?)
...nml/python/jep/instance/AbstractJepEvaluation.java 66.66% <0%> (ø) 2% <0%> (?)
...edzai/openml/python/ClassificationPythonModel.java 0% <0%> (ø) 0% <0%> (?)
...n/AbstractClassificationPythonModelLoaderImpl.java 0% <0%> (ø) 0% <0%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f2d92f...1337706. Read the comment docs.

@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@feedzai feedzai deleted a comment Dec 10, 2018
@nmldiegues
Copy link
Contributor

This is pretty nice @paulojrp . Of course it has the downside that it requires code changes for every new library with the same problem, but it is an acceptable trade-off. I trust that you ran the UT regression of #26 and it passed?

I just have 1 suggestion: Please make the list of packages added to shared modules read dynamically (environment variable? a resource file inside the openml jar? please consider remote spark executions when devising the solution) --- this will make users autonomous to overcome this problem for a new library in the future; please add a note to the README about the #26 and this solution

@paulojrp paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from e3c6a86 to 0214701 Compare December 11, 2018 19:17
@pedrorijo91 pedrorijo91 added this to the next version milestone Dec 12, 2018
@paulojrp paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from 0214701 to a34f327 Compare December 13, 2018 13:39
@feedzai feedzai deleted a comment Dec 13, 2018
@feedzai feedzai deleted a comment Dec 13, 2018
@paulojrp paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from a34f327 to b70673c Compare December 13, 2018 17:17
@feedzai feedzai deleted a comment Dec 13, 2018
@feedzai feedzai deleted a comment Dec 13, 2018
@feedzai feedzai deleted a comment from TravisBuddy Dec 14, 2018
@feedzai feedzai deleted a comment from TravisBuddy Dec 14, 2018
@feedzai feedzai deleted a comment from TravisBuddy Dec 14, 2018
@feedzai feedzai deleted a comment from TravisBuddy Dec 14, 2018
Copy link
Contributor

@pedrorijo91 pedrorijo91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks great! :) just a few minor style issues. Feel free to disagree with them

@@ -40,3 +40,15 @@ export PATH=$ANACONDA_PATH/envs/myenv/bin:$PATH
export LD_LIBRARY_PATH=$ANACONDA_PATH/envs/myenv/lib/python3.6/site-packages/jep:$LD_LIBRARY_PATH
export LD_PRELOAD=$ANACONDA_PATH/envs/myenv/lib/libpython3.6m.so
```

7. If you need to share Python modules across sub-interpreters, you would need to create a "python-packages.xml" file where you define the modules to be shared. By default the provider is already sharing the "numpy" and "tensorflow" modules. This is a workaround for the issues with CPython extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be 'you will need to create a (...)'. @krisztinaknagy to validate this one

@@ -40,3 +40,15 @@ export PATH=$ANACONDA_PATH/envs/myenv/bin:$PATH
export LD_LIBRARY_PATH=$ANACONDA_PATH/envs/myenv/lib/python3.6/site-packages/jep:$LD_LIBRARY_PATH
export LD_PRELOAD=$ANACONDA_PATH/envs/myenv/lib/libpython3.6m.so
```

7. If you need to share Python modules across sub-interpreters, you would need to create a "python-packages.xml" file where you define the modules to be shared. By default the provider is already sharing the "numpy" and "tensorflow" modules. This is a workaround for the issues with CPython extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any official documentation for this python-packages.xml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bellow you can find an example of the contents of this file

Copy link
Contributor

@pedrorijo91 pedrorijo91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I would expect a UT like the one described in #26 . Is it too hard to automate?

@paulojrp paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from b70673c to 3c03a58 Compare December 14, 2018 11:50
@TravisBuddy
Copy link

Hey @paulojrp,
Something went wrong with the build.

TravisCI finished with status errored, which means the build failed because of something unrelated to the tests, such as a problem with a dependency or the build process itself.

View build log

TravisBuddy Request Identifier: d0ff2700-ff96-11e8-88a0-5d55312123ee

Copy link
Contributor

@nmldiegues nmldiegues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedrorijo91 for adding that UT we'd need tensorflow to be added to the Docker image used in Travis (and then to the base openml image used inside Feedzai). I'm not sure we want to keep adding all the stuff to the images, they'll grow endlessly.

@paulojrp Reviewed, mostly minor comments: please check the coverage and improve it, and tackle my comments. Thanks!

@pedrorijo91
Copy link
Contributor

can't we simply create a custom model that loads python stuff @nmldiegues ? anyway, if it's too time/effort-consuming, let's be pragmatic and go without it :)

A problem was detected when loading TensorFlow models in different threads inside the same JVM. That happened after load a TensorFlow model and then try to import a new TensorFlow model. This was caused by a dependency of TensorFlow (protobuf) that was being reloaded but it already existed in the JVM (through the 1st thread).

The workaround was to share the problematic module (Tensorflow) across the sub-interpreters of Python. This is a workaround for the issues with CPython extensions.
@paulojrp paulojrp force-pushed the fix-model-with-non-thread-safe-dependencies branch from 3c03a58 to 1337706 Compare December 14, 2018 14:47
@TravisBuddy
Copy link

Hey @paulojrp,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: ade3d770-ffaf-11e8-88a0-5d55312123ee

@paulojrp
Copy link
Contributor Author

paulojrp commented Dec 14, 2018

@nmldiegues I created a new UT but the coverage is still decreasing a lot, I don't see how this is related with my modifications.

@pedrorijo91 to reproduce this problem we only had to load a python file that imports tensorflow. The problem with a unit-test for this case is that we would have to modify the docker image to also install tensorflow. we could do it but then what would happen when new packages with these problems are found? (modify the image to also check that). Don't forget that we don't have an openml provider for tensorflow.
Also this modification was tested manually by loading an tensorflow model and by using it to score events.

Copy link
Contributor

@pedrorijo91 pedrorijo91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good on my side then :)

@paulojrp paulojrp merged commit c064e59 into master Dec 14, 2018
@pedrorijo91 pedrorijo91 deleted the fix-model-with-non-thread-safe-dependencies branch December 14, 2018 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants