Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with MLflow #353

Merged
merged 28 commits into from May 27, 2019
Merged

Integration with MLflow #353

merged 28 commits into from May 27, 2019

Conversation

thunterdb
Copy link
Contributor

@thunterdb thunterdb commented May 19, 2019

This PR adds basic integration with MLflow, so that models that have the pyfunc flavor (which is, most of them), can be loaded as predictors. These predictors then works on both pandas and koalas dataframes with no code change. See the documentation example for details.

The goal of this PR is to introduce a simple interface for ML models, but keeping this dependency optional so that the base installation has minimal dependencies.

Note: travis uses pip because no version of mlflow is published in conda-forge for python 3.5

@codecov-io
Copy link

codecov-io commented May 19, 2019

Codecov Report

Merging #353 into master will increase coverage by <.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #353      +/-   ##
==========================================
+ Coverage   94.74%   94.75%   +<.01%     
==========================================
  Files          41       42       +1     
  Lines        4513     4554      +41     
==========================================
+ Hits         4276     4315      +39     
- Misses        237      239       +2
Impacted Files Coverage Δ
databricks/koalas/frame.py 95.54% <100%> (ø) ⬆️
databricks/koalas/mlflow.py 95.12% <95.12%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10db730...8fc73a8. Read the comment docs.

@thunterdb thunterdb changed the title [WIP] Integration with MLflow Integration with MLflow May 19, 2019
@rxin
Copy link
Contributor

rxin commented May 22, 2019

This is super cool.

setup.py Outdated Show resolved Hide resolved
@@ -59,6 +59,7 @@ install:
- conda config --env --add pinned_packages pyarrow==$PYARROW_VERSION
- conda install -c conda-forge --yes codecov
- conda install -c conda-forge --yes --file requirements-dev.txt
- pip install mlflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Is mlflow not in conda? If it;s uploaded, we can just write it into requirements-dev.txt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually mlflow is in conda, but seems like it breaks dependency with Python 3.5 for some reason.
https://travis-ci.com/databricks/koalas/jobs/201362227

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, then can we use conda and skip it for Python 3.5 one? We can do it in a followup too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried hard, but currently mlflow is not packaged for python 3.5 and 3.7. I suggest that until the 1.0 release, we keep it this way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: mlflow for Python 3.7 is available on conda-forge.
If we keep this way, we need to add some note in CONTRIBUTING.md that we need to install mlflow separately for testing, especially for Python 3.5 users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I was not aware we do put mlflow on conda. Maybe conda pulls it from Pip? I would definitely recommend loading it from pip for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Here's the mlflow recipe. https://github.com/conda-forge/mlflow-feedstock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't think conda pulls it from PIP out of the box)

@HyukjinKwon
Copy link
Member

Looks fine but I suspect I don't know enough of mlflow. I don't mind merging it. I will leave it to you guys.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with mlflow as well, so I'd leave it to @thunterdb.

One thing I noticed is, client = MlflowClient() in the doctests creates a directory mlruns in the working directory.
Can we move to tmp dir or somewhere?

databricks/koalas/mlflow.py Show resolved Hide resolved
databricks/koalas/mlflow.py Outdated Show resolved Hide resolved
databricks/koalas/mlflow.py Outdated Show resolved Hide resolved
@thunterdb
Copy link
Contributor Author

@ueshin good point about the temp directory, now it puts the runs in the system temp directory, as it should.

databricks/koalas/mlflow.py Outdated Show resolved Hide resolved
databricks/koalas/mlflow.py Outdated Show resolved Hide resolved
... mlflow.sklearn.log_model(lr, "model")
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now that our model is logged using MLflow, we load it back and apply it on a Koalas dataframe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame

>>> # Will fail with a message about dataframes not aligned.
>>> df["y"] = y # doctest: +SKIP

This is being tracked in the issue ticket https://github.com/databricks/koalas/issues/354.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we actually going to address this issue? if not, perhaps we should create a function that simply adds a column to an existing dataframe. Then it is more obvious what can be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being resolved as of #381 . Then users will have to .merge() for the time being, but this should be acceptable enough for most use cases. I will update the documentation once that PR is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation updated.

return as_spark_type(hint)

@lazy_property
def _model(self) -> PythonModel:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of load_pyfunc is PythonModel only for custom pyfunc models. Other flavors may return any object they like, the only requirement is it has a predict method.
I am not sure how the python type annotations work. Is this checked at runtime?

Also, load_pyfunc is being renamed to load_model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info about the rename and the return type. Python's typing module allows to define structural subtyping but the syntax is not nice and it is not used too much by the typechecking tools anyway.

def predict(self, data):
"""
Returns a prediction on the data.
If the data is a pandas Dataframe, the return is a pandas series (TODO: np.array?)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is pandas.Series. (only for the udf though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in the case of pandas dataframe input, the output can be whatever (and the doctest shows that sklearn returns a np.array instead).

return self._model.predict(data)
if isinstance(data, DataFrame):
cols = [data._sdf[n] for n in data.columns]
return_col = self._model_udf(*cols)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, when you apply the udf to *cols, the data is passed to the udf as an array fo columns and the column names are lost which breaks many models. Starting dbr 5.4 conda, you can pass the columns as as a struct in which case the udf gets a dataframe with column names.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g you can call it like this:

spark_df.withColumn("prediction", f(struct("x", "label"))).select("prediction")

Or you will be able to after we update our spark_udf call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to know. I just tried and it is not supported yet in OSS Spark <= 2.4.3 (currently the latest). I have put a comment to fix this, along with the code snippet, once spark 2.4.4+ is released.

Copy link
Collaborator

@ueshin ueshin May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it won't be included in 2.4 series since it's a new feature after 2.4 code freeze. We usually don't include new features after code freeze.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Unfortunately until that feature becomes available you may be returning different results for pandas dataframe vs spark dataframe. The spark_udf currently returns the left-most numeric column, so if your model predict ran on pandas df returns anything else than 1d numeric vector it's gonna be inconsistent. The return type of mlflow spark_udf can be controlled by passing in explicit return_type argument. Unfortunately there is no way getting it back as a struct. I would consider copying the same logic here for consistency, but that's not great either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I've you are already passing the return type argument to spark_udf.

@thunterdb
Copy link
Contributor Author

@ueshin @tomasatdatabricks comments addressed. Let me know if you would like to merge.

Copy link

@tomasatdatabricks tomasatdatabricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

I left a comment about mlflow model potentially returning different results for pandas vs spark Dataframe but I don't think that needs to be addressed right now.

@ueshin
Copy link
Collaborator

ueshin commented May 27, 2019

LGTM except for the Spark version in the comment related to https://github.com/databricks/koalas/pull/353/files#r287638021.

@HyukjinKwon
Copy link
Member

Looks good to me too. I can take a separate look for conda.

@softagram-bot
Copy link

Softagram Impact Report for pull/353 (head commit: 8fc73a8)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to support@softagram.com

@thunterdb
Copy link
Contributor Author

@ueshin comment fixed, I am merging. Thanks!

@thunterdb thunterdb merged commit 94c0342 into databricks:master May 27, 2019
@ueshin
Copy link
Collaborator

ueshin commented May 28, 2019

oops, the next Spark release will be 3.0. I'll fix it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants