[SPARK-8546] Add PMML export for Naive Bayes #9057

yinxusen · 2015-10-10T04:20:48Z

Add PMML export for Naive Bayes, JIRA issue https://issues.apache.org/jira/browse/SPARK-8546

SparkQA · 2015-10-10T05:09:10Z

Test build #43515 has finished for PR 9057 at commit 8bf481b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-10-27T04:52:31Z

@JasmineGeorge Could you make a pass?

mengxr · 2015-10-27T04:53:41Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

 import java.lang.{Iterable => JIterable}

+import org.apache.spark.mllib.pmml.PMMLExportable


organize imports

mengxr · 2015-10-27T06:54:41Z

@yinxusen Could you update the PR title? SAPRK is a typo.

JasmineGeorge · 2015-10-27T13:00:35Z

mllib/src/main/scala/org/apache/spark/mllib/pmml/export/NaiveBayesPMMLModelExport.scala

+import org.apache.spark.mllib.classification.{NaiveBayesModel => SNaiveBayesModel}
+
+/**
+ * PMML Model Export for GeneralizedLinearModel abstract class


small slip, change the GeneralizedLinearModel to NaiveBayesModel.

SparkQA · 2015-10-28T05:01:09Z

Test build #44488 has finished for PR 9057 at commit 1a609f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-10-28T15:55:46Z

@JasmineGeorge Please sign off if the changes look good to you:)

selvinsource · 2015-10-29T13:00:31Z

@JasmineGeorge, it would be great if you can add a test for the validator to ensure the exported xml file can be loaded in JPMML and score the same results.

Please use my latest branch
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class

I renamed the datasets' names to be generic so that we can use them for different algorithms for example iris can be used for both kmeans and multiclass logistic regression.

JasmineGeorge · 2015-10-29T13:37:10Z

Sorry I can't get to it until next Wednesday.. Can someone else take over

selvinsource · 2015-10-29T21:59:35Z

I will do it, no prob.

selvinsource · 2015-10-31T09:31:23Z

@yinxusen
If you look at
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class
I added a test for your naive bayes export.

To generate the xml I used this code:
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/logistic_regression_multi_class/src/main/resources/spark_shell_exporter/naivebayes_iris.scala

Here the xml model generated:
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/logistic_regression_multi_class/src/main/resources/exported_pmml_models/naivebayes_classification.xml

If I run the jpmml evaluation I get this exception:
java -jar target/spark-pmml-exporter-validator-1.1.0-SNAPSHOT-jar-with-dependencies.jar NaiveBayesClassificationModel
NaiveBayesClassificationModel selected

Exception in thread "main" java.lang.NullPointerException
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1838)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at java.lang.Double.valueOf(Double.java:502)
    at org.jpmml.evaluator.TypeUtil.parseDouble(TypeUtil.java:136)
    at org.jpmml.evaluator.TypeUtil.parse(TypeUtil.java:78)
    at org.jpmml.evaluator.FieldValue.parseValue(FieldValue.java:107)
    at org.jpmml.evaluator.FieldValue.equalsString(FieldValue.java:54)
    at org.jpmml.evaluator.NaiveBayesModelEvaluator.getTargetValueCounts(NaiveBayesModelEvaluator.java:333)
    at org.jpmml.evaluator.NaiveBayesModelEvaluator.evaluateClassification(NaiveBayesModelEvaluator.java:154)
    at org.jpmml.evaluator.NaiveBayesModelEvaluator.evaluate(NaiveBayesModelEvaluator.java:94)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:79)
    at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.evaluate(SparkPMMLExporterValidator.java:219)
    at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.evaluateMultiClassClassificationModelIris(SparkPMMLExporterValidator.java:130)
    at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.main(SparkPMMLExporterValidator.java:94)

I didn't look too much into the exception above, @vruusmann will probably confirm it, but I did spot some potential issues/inconsistencies in the xml exported.

The definition:

       <DataField name="target" optype="categorical" dataType="double">
            <Value value="0"/>
            <Value value="1"/>
            <Value value="2"/>
        </DataField>

should be changed to

        <DataField name="class" optype="categorical" dataType="double">
            <Value value="0.0"/>
            <Value value="1.0"/>
            <Value value="2.0"/>
        </DataField>

Consequently

            <MiningField name="target" usageType="target"/>

to

            <MiningField name="class" usageType="predicted"/>

While the above I don't think they cause the exception, but it would be nice to align to the conventions used by @JasmineGeorge,
this following bit could potentially be the cause of the error:

                        <TargetValueCount value="target_1" count="-0.8808827544295097"/>

should be

                        <TargetValueCount value="1.0" count="-0.8808827544295097"/>

as target_1 is never defined and it should be 1.0 which is one of the class values.

Please use the branch https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class to ensure the exported xml produce the correct scoring using jpmml.

vruusmann · 2015-10-31T19:54:20Z

The value of the TargetValueCount@value attribute must equal some valid value of the target DataField element (as defined by DataField/Value@value attribute). For double data type, the equality is defined by method Double#equals(Object). So, it should be perfectly OK to use literal 1.0 in one place and 1 in the other place - they represent the same numeric value after all.

vruusmann · 2015-10-31T19:56:50Z

You may want to check out some valid NaiveBayes models. For example, see the following NB model for the popular "Audit" dataset: https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-rattle/src/test/resources/pmml/NaiveBayesAudit.pmml

yinxusen · 2015-11-02T02:07:23Z

@selvinsource I"ll check it ASAP. Thanks!

yinxusen · 2015-11-11T04:41:44Z

@selvinsource Sorry for taking too long a time. I check the code and generated XML file carefully. The null pointer is caused by a mistake that I process continuous features into categorical ones.

Actually, the naive bayes model generated in multinomial distribution should be treated as continuous features, and we should use

Continuous Input3   i3  mean[i3,t1],variance[i3,t1] mean[i3,t2],variance[i3,t2] mean[i3,t3],variance[i3,t3]

to generate the XML file, other than categorical ones.

For model generated in Bernoulli way, we should treat its features categorically. I.e. use

Discrete Input2 i21 count[i21,t1]   count[i21,t2]   count[i21,t3]   ...
i22 count[i22,t1]   count[i22,t2]   count[i22,t3]   ...
i23 count[i23,t1]   count[i23,t2]   count[i23,t3]   ...
... ... ... ...

selvinsource · 2015-11-12T09:24:23Z

@yinxusen for multinomial naive Bayes you could still use the inputs as discrete as they should be frequency of the terms accordingly to the documentation, therefore discrete.
However if the algorithm allows these to be continous numbers, then you solution covers both cases.

SparkQA · 2015-11-12T10:12:00Z

Test build #45725 has finished for PR 9057 at commit 7d8fcb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2015-11-13T10:31:52Z

@selvinsource @mengxr I modified your code of pmml export validation. My current code can pass both Multinomial and Bernoulli cases. However, I am very confused by the PMML definition with multinomial distribution case.

As said in the PMML Naive Bayes Guide, we can see that there are two kinds of features - categorical one and continuous one. Since we use LabeledPoint as our input under the multinomial case, I believe that we should treat each feature as a continuous input. Even though we can discretize those continuous features into categorical ones, we cannot do it here because it's hard to estimate the range of every input feature here with the limited knowledge of NaiveBayesModel.

In the continuous setting, PMML for Naive Bayes provides two different distributions - the Gaussian distribution and the Poisson distribution. But neither Gaussian nor Poisson fit the multinominal case, because the scoring procedure is different with our multi-normial scenario.

Currently, I use Gaussian distribution for continuous features, and use 1.0 as a pseudo variance. But I am not sure the correctness.

SparkQA · 2015-11-13T10:40:08Z

Test build #45855 has finished for PR 9057 at commit 4dad4db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2015-11-13T16:09:22Z

If you want to see the exported xml of multinomial distribution, click here. For bernoulli case, click here.

selvinsource · 2015-11-13T17:19:02Z

@yinxusen I will check out your branch and do some testing as well using the validator.
From what I can see the exported xml seems correct 👍 .

yinxusen · 2015-11-14T01:08:46Z

@selvinsource Yes I looks correct and the same with what I exported from R (with libraries pmml and e1071 for naive bayes). But I am a little worried about the Gaussian distribution that I used in the XML.

selvinsource · 2015-11-15T20:48:25Z

@yinxusen
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class
I tested both multinomial and bernoulli.
The bernoulli results are good, I used the SPEC Heart dataset.
The multinomial results are not as good, the scores in jpmml differ from the spark predict, this confirms your worries.

We could start supporting only Bernoulli and throw a IllegalArgumentException for Multinomial in PMMLModelExportFactory.

yinxusen · 2015-11-19T10:13:20Z

@mengxr How do you think about the PMML export for Multinomial Naive Bayes?

yinxusen · 2015-12-08T07:16:33Z

@mengxr @selvinsource As we talked there, I don't think PMML has good supports for multinomial naive bayes because we cannot fit the model of multinomial naive bayes into PMML with correct prediction result. I plan to remove the support for multinomial NB here and throw a IllegalArgumentException.

SparkQA · 2015-12-08T09:41:46Z

Test build #47326 has finished for PR 9057 at commit 5a89d9d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2015-12-08T11:34:34Z

retest this please

SparkQA · 2015-12-08T11:52:26Z

Test build #47334 has finished for PR 9057 at commit 5a89d9d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-08T17:10:43Z

Test build #47341 has finished for PR 9057 at commit b17491d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaSQLTransformerExample\n * final class DecisionTreeClassifier @Since(\"1.4.0\") (\n * final class GBTClassifier @Since(\"1.4.0\") (\n * class LogisticRegression @Since(\"1.2.0\") (\n * class MultilayerPerceptronClassifier @Since(\"1.5.0\") (\n * class NaiveBayes @Since(\"1.5.0\") (\n * final class OneVsRest @Since(\"1.4.0\") (\n * final class RandomForestClassifier @Since(\"1.4.0\") (\n

rxin · 2016-06-15T22:10:20Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it.

(This one does seem pretty useful).

add PMML export for Naive Bayes

8bf481b

mengxr mentioned this pull request Oct 27, 2015

[SPARK-8542][MLlib]PMML export for Decision Trees #7842

Closed

mengxr reviewed Oct 27, 2015
View reviewed changes

JasmineGeorge reviewed Oct 27, 2015
View reviewed changes

yinxusen changed the title ~~[SAPRK-8546] Add PMML export for Naive Bayes~~ [SPARK-8546] Add PMML export for Naive Bayes Oct 27, 2015

fix errors

1a609f5

fix errors

dd5224b

yinxusen added 4 commits November 11, 2015 16:17

fix multi-normial dist naive bayes

e1295aa

Merge branch 'master' into SPARK-8546

82ee1c2

fix bernulli model

3eb227f

fix style

7d8fcb7

yinxusen added 2 commits November 13, 2015 12:43

Merge branch 'master' of https://github.com/apache/spark into SPARK-8546

24231c3

add output

4dad4db

remove multinomial case

5a89d9d

yinxusen added 2 commits December 8, 2015 22:38

Merge branch 'master' into SPARK-8546

de83bf3

change API with JPMML 1.2.7

b17491d

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8546] Add PMML export for Naive Bayes #9057

[SPARK-8546] Add PMML export for Naive Bayes #9057

yinxusen commented Oct 10, 2015

SparkQA commented Oct 10, 2015

mengxr commented Oct 27, 2015

mengxr Oct 27, 2015

mengxr commented Oct 27, 2015

JasmineGeorge Oct 27, 2015

SparkQA commented Oct 28, 2015

mengxr commented Oct 28, 2015

selvinsource commented Oct 29, 2015

JasmineGeorge commented Oct 29, 2015

selvinsource commented Oct 29, 2015

selvinsource commented Oct 31, 2015

vruusmann commented Oct 31, 2015

vruusmann commented Oct 31, 2015

yinxusen commented Nov 2, 2015

yinxusen commented Nov 11, 2015

selvinsource commented Nov 12, 2015

SparkQA commented Nov 12, 2015

yinxusen commented Nov 13, 2015

SparkQA commented Nov 13, 2015

yinxusen commented Nov 13, 2015

selvinsource commented Nov 13, 2015

yinxusen commented Nov 14, 2015

selvinsource commented Nov 15, 2015

yinxusen commented Nov 19, 2015

yinxusen commented Dec 8, 2015

SparkQA commented Dec 8, 2015

yinxusen commented Dec 8, 2015

SparkQA commented Dec 8, 2015

SparkQA commented Dec 8, 2015

rxin commented Jun 15, 2016

		@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

		import java.lang.{Iterable => JIterable}

		import org.apache.spark.mllib.pmml.PMMLExportable

[SPARK-8546] Add PMML export for Naive Bayes #9057

[SPARK-8546] Add PMML export for Naive Bayes #9057

Conversation

yinxusen commented Oct 10, 2015

SparkQA commented Oct 10, 2015

mengxr commented Oct 27, 2015

mengxr Oct 27, 2015

Choose a reason for hiding this comment

mengxr commented Oct 27, 2015

JasmineGeorge Oct 27, 2015

Choose a reason for hiding this comment

SparkQA commented Oct 28, 2015

mengxr commented Oct 28, 2015

selvinsource commented Oct 29, 2015

JasmineGeorge commented Oct 29, 2015

selvinsource commented Oct 29, 2015

selvinsource commented Oct 31, 2015

vruusmann commented Oct 31, 2015

vruusmann commented Oct 31, 2015

yinxusen commented Nov 2, 2015

yinxusen commented Nov 11, 2015

selvinsource commented Nov 12, 2015

SparkQA commented Nov 12, 2015

yinxusen commented Nov 13, 2015

SparkQA commented Nov 13, 2015

yinxusen commented Nov 13, 2015

selvinsource commented Nov 13, 2015

yinxusen commented Nov 14, 2015

selvinsource commented Nov 15, 2015

yinxusen commented Nov 19, 2015

yinxusen commented Dec 8, 2015

SparkQA commented Dec 8, 2015

yinxusen commented Dec 8, 2015

SparkQA commented Dec 8, 2015

SparkQA commented Dec 8, 2015

rxin commented Jun 15, 2016