New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with MLflow #353
Conversation
Codecov Report
@@ Coverage Diff @@
## master #353 +/- ##
==========================================
+ Coverage 94.74% 94.75% +<.01%
==========================================
Files 41 42 +1
Lines 4513 4554 +41
==========================================
+ Hits 4276 4315 +39
- Misses 237 239 +2
Continue to review full report at Codecov.
|
This is super cool. |
@@ -59,6 +59,7 @@ install: | |||
- conda config --env --add pinned_packages pyarrow==$PYARROW_VERSION | |||
- conda install -c conda-forge --yes codecov | |||
- conda install -c conda-forge --yes --file requirements-dev.txt | |||
- pip install mlflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, Is mlflow not in conda? If it;s uploaded, we can just write it into requirements-dev.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually mlflow is in conda, but seems like it breaks dependency with Python 3.5 for some reason.
https://travis-ci.com/databricks/koalas/jobs/201362227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, then can we use conda and skip it for Python 3.5 one? We can do it in a followup too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried hard, but currently mlflow is not packaged for python 3.5 and 3.7. I suggest that until the 1.0 release, we keep it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: mlflow for Python 3.7 is available on conda-forge.
If we keep this way, we need to add some note in CONTRIBUTING.md
that we need to install mlflow separately for testing, especially for Python 3.5 users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I was not aware we do put mlflow on conda. Maybe conda pulls it from Pip? I would definitely recommend loading it from pip for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Here's the mlflow recipe. https://github.com/conda-forge/mlflow-feedstock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't think conda pulls it from PIP out of the box)
Looks fine but I suspect I don't know enough of mlflow. I don't mind merging it. I will leave it to you guys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with mlflow
as well, so I'd leave it to @thunterdb.
One thing I noticed is, client = MlflowClient()
in the doctests creates a directory mlruns
in the working directory.
Can we move to tmp dir or somewhere?
@ueshin good point about the temp directory, now it puts the runs in the system temp directory, as it should. |
... mlflow.sklearn.log_model(lr, "model") | ||
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) | ||
|
||
Now that our model is logged using MLflow, we load it back and apply it on a Koalas dataframe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame
databricks/koalas/mlflow.py
Outdated
>>> # Will fail with a message about dataframes not aligned. | ||
>>> df["y"] = y # doctest: +SKIP | ||
|
||
This is being tracked in the issue ticket https://github.com/databricks/koalas/issues/354. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we actually going to address this issue? if not, perhaps we should create a function that simply adds a column to an existing dataframe. Then it is more obvious what can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being resolved as of #381 . Then users will have to .merge()
for the time being, but this should be acceptable enough for most use cases. I will update the documentation once that PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation updated.
databricks/koalas/mlflow.py
Outdated
return as_spark_type(hint) | ||
|
||
@lazy_property | ||
def _model(self) -> PythonModel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of load_pyfunc is PythonModel only for custom pyfunc models. Other flavors may return any object they like, the only requirement is it has a predict method.
I am not sure how the python type annotations work. Is this checked at runtime?
Also, load_pyfunc is being renamed to load_model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the info about the rename and the return type. Python's typing module allows to define structural subtyping but the syntax is not nice and it is not used too much by the typechecking tools anyway.
databricks/koalas/mlflow.py
Outdated
def predict(self, data): | ||
""" | ||
Returns a prediction on the data. | ||
If the data is a pandas Dataframe, the return is a pandas series (TODO: np.array?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value is pandas.Series. (only for the udf though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, in the case of pandas dataframe input, the output can be whatever (and the doctest shows that sklearn returns a np.array instead).
return self._model.predict(data) | ||
if isinstance(data, DataFrame): | ||
cols = [data._sdf[n] for n in data.columns] | ||
return_col = self._model_udf(*cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, when you apply the udf to *cols, the data is passed to the udf as an array fo columns and the column names are lost which breaks many models. Starting dbr 5.4 conda, you can pass the columns as as a struct in which case the udf gets a dataframe with column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g you can call it like this:
spark_df.withColumn("prediction", f(struct("x", "label"))).select("prediction")
Or you will be able to after we update our spark_udf call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great to know. I just tried and it is not supported yet in OSS Spark <= 2.4.3 (currently the latest). I have put a comment to fix this, along with the code snippet, once spark 2.4.4+ is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it won't be included in 2.4 series since it's a new feature after 2.4 code freeze. We usually don't include new features after code freeze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
Unfortunately until that feature becomes available you may be returning different results for pandas dataframe vs spark dataframe. The spark_udf currently returns the left-most numeric column, so if your model predict ran on pandas df returns anything else than 1d numeric vector it's gonna be inconsistent. The return type of mlflow spark_udf can be controlled by passing in explicit return_type argument. Unfortunately there is no way getting it back as a struct. I would consider copying the same logic here for consistency, but that's not great either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I've you are already passing the return type argument to spark_udf.
@ueshin @tomasatdatabricks comments addressed. Let me know if you would like to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
I left a comment about mlflow model potentially returning different results for pandas vs spark Dataframe but I don't think that needs to be addressed right now.
LGTM except for the Spark version in the comment related to https://github.com/databricks/koalas/pull/353/files#r287638021. |
Looks good to me too. I can take a separate look for conda. |
Softagram Impact Report for pull/353 (head commit: 8fc73a8)⭐ Change Overview
📄 Full report
Give feedback on this report to support@softagram.com |
@ueshin comment fixed, I am merging. Thanks! |
oops, the next Spark release will be 3.0. I'll fix it soon. |
This PR adds basic integration with MLflow, so that models that have the
pyfunc
flavor (which is, most of them), can be loaded as predictors. These predictors then works on both pandas and koalas dataframes with no code change. See the documentation example for details.The goal of this PR is to introduce a simple interface for ML models, but keeping this dependency optional so that the base installation has minimal dependencies.
Note: travis uses pip because no version of mlflow is published in conda-forge for python 3.5