Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8546] Add PMML export for Naive Bayes #9057

Closed
wants to merge 12 commits into from

Conversation

yinxusen
Copy link
Contributor

Add PMML export for Naive Bayes, JIRA issue https://issues.apache.org/jira/browse/SPARK-8546

@SparkQA
Copy link

SparkQA commented Oct 10, 2015

Test build #43515 has finished for PR 9057 at commit 8bf481b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Oct 27, 2015

@JasmineGeorge Could you make a pass?

@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

import java.lang.{Iterable => JIterable}

import org.apache.spark.mllib.pmml.PMMLExportable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports

@mengxr
Copy link
Contributor

mengxr commented Oct 27, 2015

@yinxusen Could you update the PR title? SAPRK is a typo.

import org.apache.spark.mllib.classification.{NaiveBayesModel => SNaiveBayesModel}

/**
* PMML Model Export for GeneralizedLinearModel abstract class

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small slip, change the GeneralizedLinearModel to NaiveBayesModel.

@yinxusen yinxusen changed the title [SAPRK-8546] Add PMML export for Naive Bayes [SPARK-8546] Add PMML export for Naive Bayes Oct 27, 2015
@SparkQA
Copy link

SparkQA commented Oct 28, 2015

Test build #44488 has finished for PR 9057 at commit 1a609f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Oct 28, 2015

@JasmineGeorge Please sign off if the changes look good to you:)

@selvinsource
Copy link
Contributor

@JasmineGeorge, it would be great if you can add a test for the validator to ensure the exported xml file can be loaded in JPMML and score the same results.

Please use my latest branch
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class

I renamed the datasets' names to be generic so that we can use them for different algorithms for example iris can be used for both kmeans and multiclass logistic regression.

@JasmineGeorge
Copy link

Sorry I can't get to it until next Wednesday.. Can someone else take over

@selvinsource
Copy link
Contributor

I will do it, no prob.

@selvinsource
Copy link
Contributor

@yinxusen
If you look at
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class
I added a test for your naive bayes export.

To generate the xml I used this code:
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/logistic_regression_multi_class/src/main/resources/spark_shell_exporter/naivebayes_iris.scala

Here the xml model generated:
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/logistic_regression_multi_class/src/main/resources/exported_pmml_models/naivebayes_classification.xml

If I run the jpmml evaluation I get this exception:
java -jar target/spark-pmml-exporter-validator-1.1.0-SNAPSHOT-jar-with-dependencies.jar NaiveBayesClassificationModel
NaiveBayesClassificationModel selected

Exception in thread "main" java.lang.NullPointerException
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1838)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at java.lang.Double.valueOf(Double.java:502)
    at org.jpmml.evaluator.TypeUtil.parseDouble(TypeUtil.java:136)
    at org.jpmml.evaluator.TypeUtil.parse(TypeUtil.java:78)
    at org.jpmml.evaluator.FieldValue.parseValue(FieldValue.java:107)
    at org.jpmml.evaluator.FieldValue.equalsString(FieldValue.java:54)
    at org.jpmml.evaluator.NaiveBayesModelEvaluator.getTargetValueCounts(NaiveBayesModelEvaluator.java:333)
    at org.jpmml.evaluator.NaiveBayesModelEvaluator.evaluateClassification(NaiveBayesModelEvaluator.java:154)
    at org.jpmml.evaluator.NaiveBayesModelEvaluator.evaluate(NaiveBayesModelEvaluator.java:94)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:79)
    at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.evaluate(SparkPMMLExporterValidator.java:219)
    at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.evaluateMultiClassClassificationModelIris(SparkPMMLExporterValidator.java:130)
    at org.selvinsource.spark_pmml_exporter_validator.SparkPMMLExporterValidator.main(SparkPMMLExporterValidator.java:94)

I didn't look too much into the exception above, @vruusmann will probably confirm it, but I did spot some potential issues/inconsistencies in the xml exported.

The definition:

       <DataField name="target" optype="categorical" dataType="double">
            <Value value="0"/>
            <Value value="1"/>
            <Value value="2"/>
        </DataField>

should be changed to

        <DataField name="class" optype="categorical" dataType="double">
            <Value value="0.0"/>
            <Value value="1.0"/>
            <Value value="2.0"/>
        </DataField>

Consequently

            <MiningField name="target" usageType="target"/>

to

            <MiningField name="class" usageType="predicted"/>

While the above I don't think they cause the exception, but it would be nice to align to the conventions used by @JasmineGeorge,
this following bit could potentially be the cause of the error:

                        <TargetValueCount value="target_1" count="-0.8808827544295097"/>

should be

                        <TargetValueCount value="1.0" count="-0.8808827544295097"/>

as target_1 is never defined and it should be 1.0 which is one of the class values.

Please use the branch https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class to ensure the exported xml produce the correct scoring using jpmml.

@vruusmann
Copy link
Contributor

The value of the TargetValueCount@value attribute must equal some valid value of the target DataField element (as defined by DataField/Value@value attribute). For double data type, the equality is defined by method Double#equals(Object). So, it should be perfectly OK to use literal 1.0 in one place and 1 in the other place - they represent the same numeric value after all.

@vruusmann
Copy link
Contributor

You may want to check out some valid NaiveBayes models. For example, see the following NB model for the popular "Audit" dataset: https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-rattle/src/test/resources/pmml/NaiveBayesAudit.pmml

@yinxusen
Copy link
Contributor Author

yinxusen commented Nov 2, 2015

@selvinsource I"ll check it ASAP. Thanks!

@yinxusen
Copy link
Contributor Author

@selvinsource Sorry for taking too long a time. I check the code and generated XML file carefully. The null pointer is caused by a mistake that I process continuous features into categorical ones.

Actually, the naive bayes model generated in multinomial distribution should be treated as continuous features, and we should use

Continuous Input3   i3  mean[i3,t1],variance[i3,t1] mean[i3,t2],variance[i3,t2] mean[i3,t3],variance[i3,t3]

to generate the XML file, other than categorical ones.

For model generated in Bernoulli way, we should treat its features categorically. I.e. use

Discrete Input2 i21 count[i21,t1]   count[i21,t2]   count[i21,t3]   ...
i22 count[i22,t1]   count[i22,t2]   count[i22,t3]   ...
i23 count[i23,t1]   count[i23,t2]   count[i23,t3]   ...
... ... ... ...

@selvinsource
Copy link
Contributor

@yinxusen for multinomial naive Bayes you could still use the inputs as discrete as they should be frequency of the terms accordingly to the documentation, therefore discrete.
However if the algorithm allows these to be continous numbers, then you solution covers both cases.

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45725 has finished for PR 9057 at commit 7d8fcb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@selvinsource @mengxr I modified your code of pmml export validation. My current code can pass both Multinomial and Bernoulli cases. However, I am very confused by the PMML definition with multinomial distribution case.

As said in the PMML Naive Bayes Guide, we can see that there are two kinds of features - categorical one and continuous one. Since we use LabeledPoint as our input under the multinomial case, I believe that we should treat each feature as a continuous input. Even though we can discretize those continuous features into categorical ones, we cannot do it here because it's hard to estimate the range of every input feature here with the limited knowledge of NaiveBayesModel.

In the continuous setting, PMML for Naive Bayes provides two different distributions - the Gaussian distribution and the Poisson distribution. But neither Gaussian nor Poisson fit the multinominal case, because the scoring procedure is different with our multi-normial scenario.

Currently, I use Gaussian distribution for continuous features, and use 1.0 as a pseudo variance. But I am not sure the correctness.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45855 has finished for PR 9057 at commit 4dad4db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

If you want to see the exported xml of multinomial distribution, click here. For bernoulli case, click here.

@selvinsource
Copy link
Contributor

@yinxusen I will check out your branch and do some testing as well using the validator.
From what I can see the exported xml seems correct 👍 .

@yinxusen
Copy link
Contributor Author

@selvinsource Yes I looks correct and the same with what I exported from R (with libraries pmml and e1071 for naive bayes). But I am a little worried about the Gaussian distribution that I used in the XML.

@selvinsource
Copy link
Contributor

@yinxusen
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/logistic_regression_multi_class
I tested both multinomial and bernoulli.
The bernoulli results are good, I used the SPEC Heart dataset.
The multinomial results are not as good, the scores in jpmml differ from the spark predict, this confirms your worries.

We could start supporting only Bernoulli and throw a IllegalArgumentException for Multinomial in PMMLModelExportFactory.

@yinxusen
Copy link
Contributor Author

@mengxr How do you think about the PMML export for Multinomial Naive Bayes?

@yinxusen
Copy link
Contributor Author

yinxusen commented Dec 8, 2015

@mengxr @selvinsource As we talked there, I don't think PMML has good supports for multinomial naive bayes because we cannot fit the model of multinomial naive bayes into PMML with correct prediction result. I plan to remove the support for multinomial NB here and throw a IllegalArgumentException.

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47326 has finished for PR 9057 at commit 5a89d9d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

yinxusen commented Dec 8, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47334 has finished for PR 9057 at commit 5a89d9d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47341 has finished for PR 9057 at commit b17491d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaSQLTransformerExample\n * final class DecisionTreeClassifier @Since(\"1.4.0\") (\n * final class GBTClassifier @Since(\"1.4.0\") (\n * class LogisticRegression @Since(\"1.2.0\") (\n * class MultilayerPerceptronClassifier @Since(\"1.5.0\") (\n * class NaiveBayes @Since(\"1.5.0\") (\n * final class OneVsRest @Since(\"1.4.0\") (\n * final class RandomForestClassifier @Since(\"1.4.0\") (\n

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it.

(This one does seem pretty useful).

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants