Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1966][ml]Add support for Predictive Model Markup Language #1186

Closed
wants to merge 1 commit into from

Conversation

sachingoel0101
Copy link
Contributor

  1. Adds an interface to allow exporting of models to PMML format.
  2. Implements export methods for the existing SVM and Regression algorithms.

@@ -39,6 +40,10 @@ import org.apache.flink.ml.math.SparseVector
*/
object MLUtils {

val flinkApp = new Application()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flinkApp is ambiguous. I would like to use pmmlApp.

@chiwanpark
Copy link
Member

Hi @sachingoel0101, Thanks for opening pull request. Great start! I have some comments for your request.

  1. Some implementation is not scalaesque.
  2. Lack of importing PMML interface

I would prefer to cover only PMML interface (toPMML, fromPMML method) in this pull request. Covering implementation of the PMML interface in other issues would be better for me.

@sachingoel0101
Copy link
Contributor Author

The PMML model is quite extensive, and there isn't enough support in the ML library for utilizing most of the things [like FieldUsageType, DataTypes etc.]. I had actually written the import functions for both SVM and MLR but decided to drop them.
I mostly followed Spark's implementation for this, and it isn't supported there either.

Edit: @chiwanpark I've addressed all yours comments. Thanks for the review. :)

@chiwanpark
Copy link
Member

Okay, We need some discussion in mailing list about ML model import/export feature. I think that PMML support is one of sub-issues related to the ML model import/export issue.

I'll post the discussion thread in few days.

@smarthi
Copy link
Member

smarthi commented Oct 29, 2015

Suggest that you see how PMML been's done on Oryx 2.0 (PMML in Spark followed Oryx 2.0). PMML support was discussed various times on the Mahout project and was never implemented in large part due to lack of actual PMML usage by Machine Learning Practitioners and Data Scientists.

See this Mahout thread from last year and more specifically to Ted Dunning's comment in the thread - http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E

Given that PMML models could possibly get real huge, its a good practice to persist them in compressed format. It would also be good to be able to specify which features/fields are categorical/numeric (via a config file maybe).

@chobeat
Copy link
Contributor

chobeat commented Feb 8, 2016

Hello,

any news on this PR?

@smarthi PMML is actually an industry standard and widely used to support model portability in complex infrastructures. Assuming that is not adopted is a wrong assumption according to my knowledge and experience. There are for sure a lot of data scientists that never get in contact with this standard and I had never heard of it before my first job on a ML architecture but it's the best (and only) tool for this kind of job.

@chiwanpark
Copy link
Member

Hi @chobeat, thanks for pinging this issue. I forgot sending a discuss email to mailing thread. I think we have to discuss about followings:

  • What is main purpose to support PMML? Is this feature for only model portability in FlinkML? If not, we have to support other systems such as R or Spark MLlib.
  • What about FlinkML only format? I think that support for distributed system in PMML is poor. XML-based format is hard to parallelize.

I would like to create a general ML model importing/exporting framework. Then, we can easily add the PMML support based on the framework.

@chobeat
Copy link
Contributor

chobeat commented Feb 8, 2016

Hi @chiwanpark,

What is main purpose to support PMML? Is this feature for only model portability in FlinkML?

I've used PMML extensively in a previous project and saw many application cases other than my own. PMML export is necessary for external portability: you may need to create a model in Flink and use it on local data using a data mining tool for example, or you could deploy it in a production pipeline developed with a totally different technological stack.
PMML import is optional though: you can use JPMML (the reference implementation of PMML) to read a PMML file and perform the evaluation of the model locally to the node. Import from PMML to the native implementation of FlinkML may be a plus in terms of usability and probably performance but it's not really a blocking issue for a developer.

If not, we have to support other systems such as R or Spark MLlib.

Support for R may be interesting by itself but I can't understand what do you mean. MLlib does support PMML export (even if somewhat bugged for a few models like Naive Bayes) so it is already possible to move models from MLlib to Flink.

What about FlinkML only format? I think that support for distributed system in PMML is poor. XML-based format is hard to parallelize.

This could be interesting to guarantee the consistency of the models and to tune it to our needs. The complexity of PMML is due to the need of generality and consistency but it's often an overkill to describe simple models. Also it has only partial support for many models that we may want to implement: i.e. any of the online learning algorithms implemented in SAMOA or other online learning frameworks. I know we still miss a few pieces before reaching that point, but still...

@chiwanpark
Copy link
Member

Hi @chobeat, thanks for leaving your comments.

About compatibility with other system (such as R or MLlib), I meant that we cannot achieve compatibility with the systems even though we use PMML because there is difference between FlinkML and other systems. For example, FlinkML supports only Double as a data type. So we can achieve only partial support of PMML (especially importing model from the other systems). Is this sufficient to use in production? If yes, we would go for this.

@chobeat
Copy link
Contributor

chobeat commented Feb 9, 2016

Well that wouldn't be a problem for the export: you will create and therefore export only models that have double as datatype for parameters but that's not an issue.

This would be a problem for import though because PMML does support a wider set of data types and model types but you can't really achieve any satisfying degree of support for PMML in a platform like Flink and that's why everyone use JPMML for evaluation. You will be able to only import compatible models with compatible data fields. This would require a simple validation at runtime on the model type and on fields' data types.

@sachingoel0101
Copy link
Contributor Author

As the original author of this PR, I'd say this:
I tried implementing the import features but they aren't worth it. You have to discard most of the valid pmml models because they don't fit in with the flink framework.
Further, in my opinion, the use of flink is to train the model. Once we export that model in pmml, you can use it pretty much anywhere, say R or matlab, which support a complete pmml import and export functionality. The exported model is in most cases going to be used for testing, evaluating and predictions purposes, for which flink isn't a good platform to use anyway. This can be accomplished anywhere.

@sachingoel0101
Copy link
Contributor Author

That said, just for a comparison purpose, spark has its own model export and import feature, along with pmml export. Hoping to fully support pmml import in a framework like flink or spark is a next to impossible thing which requires changes to the entire way our pipelines and datasets and represented.

@chobeat
Copy link
Contributor

chobeat commented Feb 9, 2016

I agree with @sachingoel0101 on the import complexity but, from our point of view, Flink is the perfect platform to evaluate models in streaming and we are using it that way in our architecture. Why do you think it wouldn't be suitable?

@sachingoel0101
Copy link
Contributor Author

That is a good point. In streaming setting, it does indeed make sense for the model to be available. However, in my opinion, then it would make sense to actually just use jppml and import the object, followed by extracting the model parameters. Granted, it is an added effort on the user side, but I still think it beats the complexity introduced by supporting imports directly. Furthermore, it would be a bad design to have to reject valid pmml models, just because a minor thing isn't supported in Flink.

@chobeat
Copy link
Contributor

chobeat commented Feb 9, 2016

@sachingoel0101 I agree. Nonetheless, an easy way to store and move a model generated in batch to a streaming enviroment would be a really useful feature and we go back to what @chiwanpark was saying about a custom format internal to Flink.

@sachingoel0101
Copy link
Contributor Author

I'm all for that. Flink's models should be transferable at least across flink. But that should be part of a separate PR, and not block this one as it has been for far too long.
It should be pretty easy to accomplish, especially for serializable models. Otherwise we can serialize them ourself.

@zentol
Copy link
Contributor

zentol commented Feb 28, 2019

Closing since flink-ml is effectively frozen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants