-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-1966][ml]Add support for Predictive Model Markup Language #1186
Conversation
sachingoel0101
commented
Sep 28, 2015
- Adds an interface to allow exporting of models to PMML format.
- Implements export methods for the existing SVM and Regression algorithms.
a71640e
to
710de52
Compare
@@ -39,6 +40,10 @@ import org.apache.flink.ml.math.SparseVector | |||
*/ | |||
object MLUtils { | |||
|
|||
val flinkApp = new Application() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flinkApp
is ambiguous. I would like to use pmmlApp
.
Hi @sachingoel0101, Thanks for opening pull request. Great start! I have some comments for your request.
I would prefer to cover only PMML interface ( |
The PMML model is quite extensive, and there isn't enough support in the ML library for utilizing most of the things [like FieldUsageType, DataTypes etc.]. I had actually written the import functions for both SVM and MLR but decided to drop them. Edit: @chiwanpark I've addressed all yours comments. Thanks for the review. :) |
Okay, We need some discussion in mailing list about ML model import/export feature. I think that PMML support is one of sub-issues related to the ML model import/export issue. I'll post the discussion thread in few days. |
30ee4e5
to
dceabd6
Compare
Suggest that you see how PMML been's done on Oryx 2.0 (PMML in Spark followed Oryx 2.0). PMML support was discussed various times on the Mahout project and was never implemented in large part due to lack of actual PMML usage by Machine Learning Practitioners and Data Scientists. See this Mahout thread from last year and more specifically to Ted Dunning's comment in the thread - http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E Given that PMML models could possibly get real huge, its a good practice to persist them in compressed format. It would also be good to be able to specify which features/fields are categorical/numeric (via a config file maybe). |
bf58065
to
4935166
Compare
e4e1ccd
to
300d76b
Compare
Hello, any news on this PR? @smarthi PMML is actually an industry standard and widely used to support model portability in complex infrastructures. Assuming that is not adopted is a wrong assumption according to my knowledge and experience. There are for sure a lot of data scientists that never get in contact with this standard and I had never heard of it before my first job on a ML architecture but it's the best (and only) tool for this kind of job. |
Hi @chobeat, thanks for pinging this issue. I forgot sending a discuss email to mailing thread. I think we have to discuss about followings:
I would like to create a general ML model importing/exporting framework. Then, we can easily add the PMML support based on the framework. |
Hi @chiwanpark,
I've used PMML extensively in a previous project and saw many application cases other than my own. PMML export is necessary for external portability: you may need to create a model in Flink and use it on local data using a data mining tool for example, or you could deploy it in a production pipeline developed with a totally different technological stack.
Support for R may be interesting by itself but I can't understand what do you mean. MLlib does support PMML export (even if somewhat bugged for a few models like Naive Bayes) so it is already possible to move models from MLlib to Flink.
This could be interesting to guarantee the consistency of the models and to tune it to our needs. The complexity of PMML is due to the need of generality and consistency but it's often an overkill to describe simple models. Also it has only partial support for many models that we may want to implement: i.e. any of the online learning algorithms implemented in SAMOA or other online learning frameworks. I know we still miss a few pieces before reaching that point, but still... |
Hi @chobeat, thanks for leaving your comments. About compatibility with other system (such as R or MLlib), I meant that we cannot achieve compatibility with the systems even though we use PMML because there is difference between FlinkML and other systems. For example, FlinkML supports only |
Well that wouldn't be a problem for the export: you will create and therefore export only models that have This would be a problem for import though because PMML does support a wider set of data types and model types but you can't really achieve any satisfying degree of support for PMML in a platform like Flink and that's why everyone use JPMML for evaluation. You will be able to only import compatible models with compatible data fields. This would require a simple validation at runtime on the model type and on fields' data types. |
As the original author of this PR, I'd say this: |
That said, just for a comparison purpose, spark has its own model export and import feature, along with pmml export. Hoping to fully support pmml import in a framework like flink or spark is a next to impossible thing which requires changes to the entire way our pipelines and datasets and represented. |
I agree with @sachingoel0101 on the import complexity but, from our point of view, Flink is the perfect platform to evaluate models in streaming and we are using it that way in our architecture. Why do you think it wouldn't be suitable? |
That is a good point. In streaming setting, it does indeed make sense for the model to be available. However, in my opinion, then it would make sense to actually just use jppml and import the object, followed by extracting the model parameters. Granted, it is an added effort on the user side, but I still think it beats the complexity introduced by supporting imports directly. Furthermore, it would be a bad design to have to reject valid pmml models, just because a minor thing isn't supported in Flink. |
@sachingoel0101 I agree. Nonetheless, an easy way to store and move a model generated in batch to a streaming enviroment would be a really useful feature and we go back to what @chiwanpark was saying about a custom format internal to Flink. |
I'm all for that. Flink's models should be transferable at least across flink. But that should be part of a separate PR, and not block this one as it has been for far too long. |
Closing since flink-ml is effectively frozen. |