Integration with MLflow #353

thunterdb · 2019-05-19T22:26:13Z

This PR adds basic integration with MLflow, so that models that have the pyfunc flavor (which is, most of them), can be loaded as predictors. These predictors then works on both pandas and koalas dataframes with no code change. See the documentation example for details.

The goal of this PR is to introduce a simple interface for ML models, but keeping this dependency optional so that the base installation has minimal dependencies.

Note: travis uses pip because no version of mlflow is published in conda-forge for python 3.5

codecov-io · 2019-05-19T23:17:24Z

Codecov Report

Merging #353 into master will increase coverage by <.01%.
The diff coverage is 95.23%.

@@            Coverage Diff             @@
##           master     #353      +/-   ##
==========================================
+ Coverage   94.74%   94.75%   +<.01%     
==========================================
  Files          41       42       +1     
  Lines        4513     4554      +41     
==========================================
+ Hits         4276     4315      +39     
- Misses        237      239       +2

Impacted Files	Coverage Δ
databricks/koalas/frame.py	`95.54% <100%> (ø)`	⬆️
databricks/koalas/mlflow.py	`95.12% <95.12%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10db730...8fc73a8. Read the comment docs.

rxin · 2019-05-22T19:58:07Z

This is super cool.

databricks/koalas/mlflow.py

setup.py

databricks/koalas/mlflow.py

HyukjinKwon · 2019-05-23T02:36:11Z

.travis.yml

@@ -59,6 +59,7 @@ install:
  - conda config --env --add pinned_packages pyarrow==$PYARROW_VERSION
  - conda install -c conda-forge --yes codecov
  - conda install -c conda-forge --yes --file requirements-dev.txt
+  - pip install mlflow


Out of curiosity, Is mlflow not in conda? If it;s uploaded, we can just write it into requirements-dev.txt.

Actually mlflow is in conda, but seems like it breaks dependency with Python 3.5 for some reason.
https://travis-ci.com/databricks/koalas/jobs/201362227

hm, then can we use conda and skip it for Python 3.5 one? We can do it in a followup too.

I tried hard, but currently mlflow is not packaged for python 3.5 and 3.7. I suggest that until the 1.0 release, we keep it this way?

FYI: mlflow for Python 3.7 is available on conda-forge.
If we keep this way, we need to add some note in CONTRIBUTING.md that we need to install mlflow separately for testing, especially for Python 3.5 users.

Interesting, I was not aware we do put mlflow on conda. Maybe conda pulls it from Pip? I would definitely recommend loading it from pip for now.

FYI: Here's the mlflow recipe. https://github.com/conda-forge/mlflow-feedstock

(I don't think conda pulls it from PIP out of the box)

HyukjinKwon · 2019-05-23T02:41:12Z

Looks fine but I suspect I don't know enough of mlflow. I don't mind merging it. I will leave it to you guys.

ueshin

I'm not familiar with mlflow as well, so I'd leave it to @thunterdb.

One thing I noticed is, client = MlflowClient() in the doctests creates a directory mlruns in the working directory.
Can we move to tmp dir or somewhere?

databricks/koalas/mlflow.py

thunterdb · 2019-05-23T16:06:44Z

@ueshin good point about the temp directory, now it puts the runs in the system temp directory, as it should.

databricks/koalas/mlflow.py

rxin · 2019-05-24T18:23:24Z

databricks/koalas/mlflow.py

+    ...     mlflow.sklearn.log_model(lr, "model")
+    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
+
+    Now that our model is logged using MLflow, we load it back and apply it on a Koalas dataframe:


rxin · 2019-05-24T18:25:17Z

databricks/koalas/mlflow.py

+    >>> # Will fail with a message about dataframes not aligned.
+    >>> df["y"] = y   # doctest: +SKIP
+
+    This is being tracked in the issue ticket https://github.com/databricks/koalas/issues/354.


are we actually going to address this issue? if not, perhaps we should create a function that simply adds a column to an existing dataframe. Then it is more obvious what can be done.

This is being resolved as of #381 . Then users will have to .merge() for the time being, but this should be acceptable enough for most use cases. I will update the documentation once that PR is merged.

Documentation updated.

tomasatdatabricks · 2019-05-24T19:57:19Z

databricks/koalas/mlflow.py

+        return as_spark_type(hint)
+
+    @lazy_property
+    def _model(self) -> PythonModel:


The return value of load_pyfunc is PythonModel only for custom pyfunc models. Other flavors may return any object they like, the only requirement is it has a predict method.
I am not sure how the python type annotations work. Is this checked at runtime?

Also, load_pyfunc is being renamed to load_model.

thanks for the info about the rename and the return type. Python's typing module allows to define structural subtyping but the syntax is not nice and it is not used too much by the typechecking tools anyway.

tomasatdatabricks · 2019-05-24T20:00:34Z

databricks/koalas/mlflow.py

+    def predict(self, data):
+        """
+        Returns a prediction on the data.
+        If the data is a pandas Dataframe, the return is a pandas series (TODO: np.array?)


The return value is pandas.Series. (only for the udf though)

Unfortunately, in the case of pandas dataframe input, the output can be whatever (and the doctest shows that sklearn returns a np.array instead).

tomasatdatabricks · 2019-05-24T20:06:44Z

databricks/koalas/mlflow.py

+            return self._model.predict(data)
+        if isinstance(data, DataFrame):
+            cols = [data._sdf[n] for n in data.columns]
+            return_col = self._model_udf(*cols)


Btw, when you apply the udf to *cols, the data is passed to the udf as an array fo columns and the column names are lost which breaks many models. Starting dbr 5.4 conda, you can pass the columns as as a struct in which case the udf gets a dataframe with column names.

E.g you can call it like this:

spark_df.withColumn("prediction", f(struct("x", "label"))).select("prediction")

Or you will be able to after we update our spark_udf call.

This is great to know. I just tried and it is not supported yet in OSS Spark <= 2.4.3 (currently the latest). I have put a comment to fix this, along with the code snippet, once spark 2.4.4+ is released.

Unfortunately, it won't be included in 2.4 series since it's a new feature after 2.4 code freeze. We usually don't include new features after code freeze.

Makes sense.

Unfortunately until that feature becomes available you may be returning different results for pandas dataframe vs spark dataframe. The spark_udf currently returns the left-most numeric column, so if your model predict ran on pandas df returns anything else than 1d numeric vector it's gonna be inconsistent. The return type of mlflow spark_udf can be controlled by passing in explicit return_type argument. Unfortunately there is no way getting it back as a struct. I would consider copying the same logic here for consistency, but that's not great either.

Oh I've you are already passing the return type argument to spark_udf.

thunterdb · 2019-05-25T07:44:15Z

@ueshin @tomasatdatabricks comments addressed. Let me know if you would like to merge.

tomasatdatabricks

lgtm!

I left a comment about mlflow model potentially returning different results for pandas vs spark Dataframe but I don't think that needs to be addressed right now.

ueshin · 2019-05-27T04:27:36Z

LGTM except for the Spark version in the comment related to https://github.com/databricks/koalas/pull/353/files#r287638021.

HyukjinKwon · 2019-05-27T06:49:44Z

Looks good to me too. I can take a separate look for conda.

softagram-bot · 2019-05-27T16:11:34Z

Softagram Impact Report for pull/353 (head commit: `8fc73a8`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/353

Give feedback on this report to support@softagram.com

thunterdb · 2019-05-27T16:25:03Z

@ueshin comment fixed, I am merging. Thanks!

ueshin · 2019-05-28T02:54:24Z

oops, the next Spark release will be 3.0. I'll fix it soon.

thunterdb added 8 commits May 19, 2019 14:55

adding mlflow support

3b4c3dc

fixes to doc

de9a5d3

fixes to doc

a8772b2

doc fix

e618ae3

fixes to doc

a2c3377

fixes to doc

e899223

travis

58fe578

travis

4580e2e

thunterdb added 5 commits May 19, 2019 16:18

travis

e2e386e

travis

aee7889

travis

15ef155

fixes to doc

fd5e0fb

fixes to doc

6d78c84

thunterdb changed the title ~~[WIP] Integration with MLflow~~ Integration with MLflow May 19, 2019

thunterdb added 2 commits May 19, 2019 16:58

fixes to doc

f8c07ca

fixes to doc

d84bbee

HyukjinKwon reviewed May 23, 2019

View reviewed changes

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 23, 2019

View reviewed changes

setup.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 23, 2019

View reviewed changes

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 23, 2019

View reviewed changes

ueshin reviewed May 23, 2019

View reviewed changes

databricks/koalas/mlflow.py Show resolved Hide resolved

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved

thunterdb added 3 commits May 23, 2019 08:35

Merge remote-tracking branch 'upstream/master' into 1805-mlflow

8725659

doc to improve covareg

6610470

fixed comments on doc

0c9aee9

rxin reviewed May 24, 2019

View reviewed changes

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved

rxin reviewed May 24, 2019

View reviewed changes

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved

rxin reviewed May 24, 2019

View reviewed changes

tomasatdatabricks reviewed May 24, 2019

View reviewed changes

thunterdb added 8 commits May 24, 2019 18:36

Merge remote-tracking branch 'upstream/master' into 1805-mlflow

345f8f6

fixed comments on doc

0d5bc61

fixed comments on doc

68c0be4

Merge remote-tracking branch 'upstream/master' into 1805-mlflow

aa23f8a

comments

dd872d2

comments

139ea71

comments

374f526

comments

638459a

tomasatdatabricks approved these changes May 27, 2019

View reviewed changes

thunterdb added 2 commits May 27, 2019 08:59

Merge remote-tracking branch 'upstream/master' into 1805-mlflow

7276764

fixed comments on doc

8fc73a8

thunterdb merged commit 94c0342 into databricks:master May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with MLflow #353

Integration with MLflow #353

thunterdb commented May 19, 2019 •

edited

codecov-io commented May 19, 2019 •

edited

rxin commented May 22, 2019

HyukjinKwon May 23, 2019

ueshin May 23, 2019

HyukjinKwon May 23, 2019

thunterdb May 23, 2019

ueshin May 24, 2019

tomasatdatabricks May 24, 2019

ueshin May 27, 2019

HyukjinKwon May 27, 2019

HyukjinKwon commented May 23, 2019

ueshin left a comment

thunterdb commented May 23, 2019

rxin May 24, 2019

rxin May 24, 2019

thunterdb May 25, 2019

thunterdb May 25, 2019

tomasatdatabricks May 24, 2019

thunterdb May 25, 2019

tomasatdatabricks May 24, 2019 •

edited

thunterdb May 25, 2019

tomasatdatabricks May 24, 2019

tomasatdatabricks May 24, 2019

thunterdb May 25, 2019

ueshin May 27, 2019 •

edited

tomasatdatabricks May 27, 2019

tomasatdatabricks May 27, 2019

thunterdb commented May 25, 2019

tomasatdatabricks left a comment

ueshin commented May 27, 2019

HyukjinKwon commented May 27, 2019

softagram-bot commented May 27, 2019

thunterdb commented May 27, 2019

ueshin commented May 28, 2019

Integration with MLflow #353

Integration with MLflow #353

Conversation

thunterdb commented May 19, 2019 • edited

codecov-io commented May 19, 2019 • edited

Codecov Report

rxin commented May 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented May 23, 2019

ueshin left a comment

Choose a reason for hiding this comment

thunterdb commented May 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasatdatabricks May 24, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin May 27, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thunterdb commented May 25, 2019

tomasatdatabricks left a comment

Choose a reason for hiding this comment

ueshin commented May 27, 2019

HyukjinKwon commented May 27, 2019

softagram-bot commented May 27, 2019

Softagram Impact Report for pull/353 (head commit: 8fc73a8)

⭐ Change Overview

📄 Full report

thunterdb commented May 27, 2019

ueshin commented May 28, 2019

thunterdb commented May 19, 2019 •

edited

codecov-io commented May 19, 2019 •

edited

tomasatdatabricks May 24, 2019 •

edited

ueshin May 27, 2019 •

edited

Softagram Impact Report for pull/353 (head commit: `8fc73a8`)