[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib #4087

leahmcguire · 2015-01-17T16:39:58Z

Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli.

When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.

Default for model is original Multinomial fit and predict.

Added additional testing for Bernoulli and Multinomial models.

…model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html

AmplabJenkins · 2015-01-17T16:42:09Z

Can one of the admins verify this patch?

rnowling · 2015-01-19T04:41:44Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala


-  {
-    // Need to put an extra pair of braces to prevent Scala treating `i` as a member.
+  def populateMatrix(arrayIn: Array[Array[Double]],


This function seems excessive. Does the Breeze library support element-wise log/exp and addition/subtraction with matrices? If so, that would be cleaner and less verbose.

rnowling · 2015-01-19T05:03:02Z

@leahmcguire,

Thanks for the patch!

A few comments:

PySpark calls the Scala API for MLlib, so for API compatibility, we can't use enumerations on the public APIs. I suggest using a string for the train() functions but keeping the enumeration for the internal API.
Can you create a new JIRA for updating the PySpark MLlib NB API? I can post details on what needs to change there -- if you don't want to do the PR for that, I can.
The populateMatrix function is verbose. Breeze seems to support element-wise operations (https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet) which might be negate the need for the populateMatrix function.
Can you update the MLlib docs in docs/mllib-naive-bayes.md ?

Thanks!

leahmcguire · 2015-01-20T03:00:20Z

Thanks for the comments!

The JIRA for the python API is:
https://issues.apache.org/jira/browse/SPARK-5328

I will get the rest fixed tonight or tomorrow.

mengxr · 2015-01-21T00:03:21Z

ok to test

SparkQA · 2015-01-21T00:07:36Z

Test build #25855 has started for PR 4087 at commit ce73c63.

This patch merges cleanly.

…. Public api now has string instead of enumeration. Docs are updated."

SparkQA · 2015-01-21T00:22:40Z

Test build #25856 has started for PR 4087 at commit 4a3676d.

This patch merges cleanly.

SparkQA · 2015-01-21T00:23:34Z

Test build #25856 has finished for PR 4087 at commit 4a3676d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T00:23:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25856/
Test FAILed.

SparkQA · 2015-01-21T01:16:47Z

Test build #25855 has finished for PR 4087 at commit ce73c63.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T01:16:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25855/
Test FAILed.

SparkQA · 2015-01-21T17:47:46Z

Test build #25894 has started for PR 4087 at commit 0313c0c.

This patch merges cleanly.

rnowling · 2015-01-21T17:56:47Z

@leahmcguire The updated patch looks great to me. :)

SparkQA · 2015-01-21T18:52:22Z

Test build #25894 has finished for PR 4087 at commit 0313c0c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T18:52:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25894/
Test FAILed.

SparkQA · 2015-01-26T18:32:48Z

Test build #26099 has started for PR 4087 at commit 76e5b0f.

This patch merges cleanly.

SparkQA · 2015-01-26T19:41:55Z

Test build #26099 has finished for PR 4087 at commit 76e5b0f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T19:41:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26099/
Test FAILed.

jkbradley · 2015-02-26T01:46:41Z

mllib/src/test/scala/org/apache/spark/mllib/classification/NaiveBayesSuite.scala

    val testRDD = sc.parallelize(testData, 2)
    testRDD.cache()

-    val model = NaiveBayes.train(testRDD)
+    val model = NaiveBayes.train(testRDD, 1.0, "Bernoulli") ///!!! this gives same result on both models check the math


Just wondering--- is the bug listed here still happening?

No this was resolved before the commit. I just forgot to remove the comment

SparkQA · 2015-02-26T17:18:04Z

Test build #28010 has started for PR 4087 at commit d9477ed.

This patch does not merge cleanly.

jkbradley · 2015-02-26T18:00:09Z

There have been a lot of changes, which must be causing the merge issues. Could you please fix them? (Sorry for the slow review; I'll try to review ASAP once it merges cleanly.)

SparkQA · 2015-02-26T18:58:50Z

Test build #28010 has finished for PR 4087 at commit d9477ed.

This patch fails MiMa tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-26T18:58:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28010/
Test FAILed.

AmplabJenkins · 2015-03-17T03:34:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28685/
Test PASSed.

jkbradley · 2015-03-20T22:55:39Z

@leahmcguire It looks like the unclean merge came from the PR earlier today for adding Python save/load. I think rebasing and fixing conflicts should be straightforward. I'll update the save/load for versions ASAP if you can fix the merge issues. Thanks very much!

SparkQA · 2015-03-21T01:03:15Z

Test build #28936 has started for PR 4087 at commit 852a727.

This patch merges cleanly.

SparkQA · 2015-03-21T02:26:33Z

Test build #28936 has finished for PR 4087 at commit 852a727.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(
- class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable
- class NaiveBayesModel(Saveable, Loader):

AmplabJenkins · 2015-03-21T02:26:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28936/
Test PASSed.

…ype parameter was added. Updated tests. Also updated ModelType enum-like type.

Added model save/load version to support NaiveBayes ModelType

SparkQA · 2015-03-24T23:13:21Z

Test build #29119 has started for PR 4087 at commit 2224b15.

This patch merges cleanly.

SparkQA · 2015-03-25T00:32:13Z

Test build #29119 has finished for PR 4087 at commit 2224b15.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(
- case class Data(
- s" but class priors vector pi had $
- s" but class conditionals array theta had $

AmplabJenkins · 2015-03-25T00:32:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29119/
Test PASSed.

jkbradley · 2015-03-25T19:39:25Z

So...that discussion on the mailing list about enum-like types just keeps going with no decision yet. Speaking with @mengxr , it might be best to support only String values for the model type instead of something nicer like enum. I don't want to make you change that, so would you mind if I sent 1 more PR to your PR?

jkbradley · 2015-03-25T19:40:37Z

(I was about to merge this, but then this issue came up.) After that adjustment, it should be fine. (And feel free to make this change yourself, but I'm offering to do it since the dev list discussion keeps going back and forth.)

leahmcguire · 2015-03-26T02:04:35Z

Either version is fine. If you have time to make the change on tomorrow go
ahead and send the PR. Otherwise I'll have time to make the change on
Friday.

On Wed, Mar 25, 2015 at 12:41 PM, jkbradley notifications@github.com
wrote:

(I was about to merge this, but then this issue came up.) After that
adjustment, it should be fine. (And feel free to make this change yourself,
but I'm offering to do it since the dev list discussion keeps going back
and forth.)

—
Reply to this email directly or view it on GitHub
#4087 (comment).

jkbradley · 2015-03-27T20:31:01Z

If you have time, I'd really appreciate it---thank you! We can eliminate the special enum-like types entirely and just use String.

SparkQA · 2015-03-28T03:23:18Z

Test build #29336 has started for PR 4087 at commit acb69af.

This patch merges cleanly.

SparkQA · 2015-03-28T04:48:00Z

Test build #29336 has finished for PR 4087 at commit acb69af.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(
- case class Data(
- s" but class priors vector pi had $
- s" but class conditionals array theta had $

AmplabJenkins · 2015-03-28T04:48:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29336/
Test PASSed.

jkbradley · 2015-03-30T20:04:48Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

+    if (supportedModelTypes.contains(modelType)) {
+      new NaiveBayes(lambda, modelType).run(input)
+    } else {
+      throw new UnknownError(s"NaiveBayes was created with an unknown ModelType: $modelType")


Can you please use require? Since this is an entry point, the parameter check should throw an IllegalArgumentException (which require does). Elsewhere, in the internals, we can throw UnknownErrors since those errors should never actually happen.

require(supportedModelTypes.contains(modelType), s"NaiveBayes was created with an unknown ModelType: $modelType")

jkbradley · 2015-03-30T20:05:51Z

@leahmcguire Thanks for updating the enum type. I just made 2 tiny comments; other than that, it looks fine.

SparkQA · 2015-03-31T01:33:20Z

Test build #29438 has started for PR 4087 at commit f3c8994.

SparkQA · 2015-03-31T02:58:16Z

Test build #29438 has finished for PR 4087 at commit f3c8994.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Data(
- case class Data(
- s" but class priors vector pi had $
- s" but class conditionals array theta had $
This patch does not change any dependencies.

AmplabJenkins · 2015-03-31T02:58:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29438/
Test PASSed.

jkbradley · 2015-03-31T18:16:23Z

LGTM Thanks very much for bearing with the issues in getting this in! Merging into master

added Bernoulli option to niave bayes model in mllib, added optional …

ce73c63

…model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html

rnowling reviewed Jan 19, 2015
View reviewed changes

Updated changes re-comments. Got rid of verbose populateMatrix method…

4a3676d

…. Public api now has string instead of enumeration. Docs are updated."

fixed style error in NaiveBayes.scala

0313c0c

removed unnecessary sort from test

76e5b0f

jkbradley reviewed Feb 26, 2015
View reviewed changes

removed old inaccurate comment from test suite for mllib naive bayes

d9477ed

merged with upstream master

852a727

jkbradley and others added 3 commits March 22, 2015 14:14

Added new model save/load format 2.0 for NaiveBayesModel after modelT…

6a8f383

…ype parameter was added. Updated tests. Also updated ModelType enum-like type.

removed old code

9ad89ca

Merge pull request #2 from jkbradley/leahmcguire-master

2224b15

Added model save/load version to support NaiveBayes ModelType

removed enum type and replaces all modelType parameters with strings

acb69af

jkbradley reviewed Mar 30, 2015
View reviewed changes

changed checks on model type to requires

f3c8994

asfgit closed this in d01a6d8 Mar 31, 2015

[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib #4087

[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib #4087

Conversation

leahmcguire commented Jan 17, 2015

AmplabJenkins commented Jan 17, 2015

rnowling Jan 19, 2015

Choose a reason for hiding this comment

rnowling commented Jan 19, 2015

leahmcguire commented Jan 20, 2015

mengxr commented Jan 21, 2015

SparkQA commented Jan 21, 2015

SparkQA commented Jan 21, 2015

SparkQA commented Jan 21, 2015

AmplabJenkins commented Jan 21, 2015

SparkQA commented Jan 21, 2015

AmplabJenkins commented Jan 21, 2015

SparkQA commented Jan 21, 2015

rnowling commented Jan 21, 2015

SparkQA commented Jan 21, 2015

AmplabJenkins commented Jan 21, 2015

SparkQA commented Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

jkbradley Feb 26, 2015

Choose a reason for hiding this comment

leahmcguire Feb 26, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 26, 2015

jkbradley commented Feb 26, 2015

SparkQA commented Feb 26, 2015

AmplabJenkins commented Feb 26, 2015

AmplabJenkins commented Mar 17, 2015

jkbradley commented Mar 20, 2015

SparkQA commented Mar 21, 2015

SparkQA commented Mar 21, 2015

AmplabJenkins commented Mar 21, 2015

SparkQA commented Mar 24, 2015

SparkQA commented Mar 25, 2015

AmplabJenkins commented Mar 25, 2015

jkbradley commented Mar 25, 2015

jkbradley commented Mar 25, 2015

leahmcguire commented Mar 26, 2015

jkbradley commented Mar 27, 2015

SparkQA commented Mar 28, 2015

SparkQA commented Mar 28, 2015

AmplabJenkins commented Mar 28, 2015

jkbradley Mar 30, 2015

Choose a reason for hiding this comment

jkbradley commented Mar 30, 2015

SparkQA commented Mar 31, 2015

SparkQA commented Mar 31, 2015

AmplabJenkins commented Mar 31, 2015

jkbradley commented Mar 31, 2015