-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8546] Add PMML export for Naive Bayes #9057
Conversation
Test build #43515 has finished for PR 9057 at commit
|
@JasmineGeorge Could you make a pass? |
@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification | |||
|
|||
import java.lang.{Iterable => JIterable} | |||
|
|||
import org.apache.spark.mllib.pmml.PMMLExportable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
organize imports
@yinxusen Could you update the PR title? |
import org.apache.spark.mllib.classification.{NaiveBayesModel => SNaiveBayesModel} | ||
|
||
/** | ||
* PMML Model Export for GeneralizedLinearModel abstract class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small slip, change the GeneralizedLinearModel to NaiveBayesModel.
Test build #44488 has finished for PR 9057 at commit
|
@JasmineGeorge Please sign off if the changes look good to you:) |
@JasmineGeorge, it would be great if you can add a test for the validator to ensure the exported xml file can be loaded in JPMML and score the same results. Please use my latest branch I renamed the datasets' names to be generic so that we can use them for different algorithms for example iris can be used for both kmeans and multiclass logistic regression. |
Sorry I can't get to it until next Wednesday.. Can someone else take over |
I will do it, no prob. |
@yinxusen To generate the xml I used this code: Here the xml model generated: If I run the jpmml evaluation I get this exception: Exception in thread "main" java.lang.NullPointerException
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1838)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.lang.Double.valueOf(Double.java:502)
at org.jpmml.evaluator.TypeUtil.parseDouble(TypeUtil.java:136)
at org.jpmml.evaluator.TypeUtil.parse(TypeUtil.java:78)
at org.jpmml.evaluator.FieldValue.parseValue(FieldValue.java:107)
at org.jpmml.evaluator.FieldValue.equalsString(FieldValue.java:54)
at org.jpmml.evaluator.NaiveBayesModelEvaluator.getTargetValueCounts(NaiveBayesModelEvaluator.java:333)
at org.jpmml.evaluator.NaiveBayesModelEvaluator.evaluateClassification(NaiveBayesModelEvaluator.java:154)
at org.jpmml.evaluator.NaiveBayesModelEvaluator.evaluate(NaiveBayesModelEvaluator.java:94)
at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:79)
at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.evaluate(SparkPMMLExporterValidator.java:219)
at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.evaluateMultiClassClassificationModelIris(SparkPMMLExporterValidator.java:130)
at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.main(SparkPMMLExporterValidator.java:94) I didn't look too much into the exception above, @vruusmann will probably confirm it, but I did spot some potential issues/inconsistencies in the xml exported. The definition: <DataField name="target" optype="categorical" dataType="double">
<Value value="0"/>
<Value value="1"/>
<Value value="2"/>
</DataField> should be changed to <DataField name="class" optype="categorical" dataType="double">
<Value value="0.0"/>
<Value value="1.0"/>
<Value value="2.0"/>
</DataField> Consequently <MiningField name="target" usageType="target"/> to <MiningField name="class" usageType="predicted"/> While the above I don't think they cause the exception, but it would be nice to align to the conventions used by @JasmineGeorge, <TargetValueCount value="target_1" count="-0.8808827544295097"/> should be <TargetValueCount value="1.0" count="-0.8808827544295097"/> as target_1 is never defined and it should be 1.0 which is one of the class values. Please use the branch https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class to ensure the exported xml produce the correct scoring using jpmml. |
The value of the |
You may want to check out some valid NaiveBayes models. For example, see the following NB model for the popular "Audit" dataset: https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-rattle/src/test/resources/pmml/NaiveBayesAudit.pmml |
@selvinsource I"ll check it ASAP. Thanks! |
@selvinsource Sorry for taking too long a time. I check the code and generated XML file carefully. The null pointer is caused by a mistake that I process continuous features into categorical ones. Actually, the naive bayes model generated in multinomial distribution should be treated as continuous features, and we should use
to generate the XML file, other than categorical ones. For model generated in Bernoulli way, we should treat its features categorically. I.e. use
|
@yinxusen for multinomial naive Bayes you could still use the inputs as discrete as they should be frequency of the terms accordingly to the documentation, therefore discrete. |
Test build #45725 has finished for PR 9057 at commit
|
@selvinsource @mengxr I modified your code of pmml export validation. My current code can pass both Multinomial and Bernoulli cases. However, I am very confused by the PMML definition with multinomial distribution case. As said in the PMML Naive Bayes Guide, we can see that there are two kinds of features - categorical one and continuous one. Since we use In the continuous setting, PMML for Naive Bayes provides two different distributions - the Gaussian distribution and the Poisson distribution. But neither Gaussian nor Poisson fit the multinominal case, because the scoring procedure is different with our multi-normial scenario. Currently, I use Gaussian distribution for continuous features, and use |
Test build #45855 has finished for PR 9057 at commit
|
@yinxusen I will check out your branch and do some testing as well using the validator. |
@selvinsource Yes I looks correct and the same with what I exported from R (with libraries pmml and e1071 for naive bayes). But I am a little worried about the Gaussian distribution that I used in the XML. |
@yinxusen We could start supporting only Bernoulli and throw a IllegalArgumentException for Multinomial in PMMLModelExportFactory. |
@mengxr How do you think about the PMML export for Multinomial Naive Bayes? |
@mengxr @selvinsource As we talked there, I don't think PMML has good supports for multinomial naive bayes because we cannot fit the model of multinomial naive bayes into PMML with correct prediction result. I plan to remove the support for multinomial NB here and throw a |
Test build #47326 has finished for PR 9057 at commit
|
retest this please |
Test build #47334 has finished for PR 9057 at commit
|
Test build #47341 has finished for PR 9057 at commit
|
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it. (This one does seem pretty useful). |
Add PMML export for Naive Bayes, JIRA issue https://issues.apache.org/jira/browse/SPARK-8546