Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib #4087

Closed
wants to merge 31 commits into from

Conversation

leahmcguire
Copy link
Contributor

Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli.

When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.

Default for model is original Multinomial fit and predict.

Added additional testing for Bernoulli and Multinomial models.

…model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
@AmplabJenkins
Copy link

Can one of the admins verify this patch?


{
// Need to put an extra pair of braces to prevent Scala treating `i` as a member.
def populateMatrix(arrayIn: Array[Array[Double]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems excessive. Does the Breeze library support element-wise log/exp and addition/subtraction with matrices? If so, that would be cleaner and less verbose.

@rnowling
Copy link
Contributor

@leahmcguire,

Thanks for the patch!

A few comments:

  1. PySpark calls the Scala API for MLlib, so for API compatibility, we can't use enumerations on the public APIs. I suggest using a string for the train() functions but keeping the enumeration for the internal API.
  2. Can you create a new JIRA for updating the PySpark MLlib NB API? I can post details on what needs to change there -- if you don't want to do the PR for that, I can.
  3. The populateMatrix function is verbose. Breeze seems to support element-wise operations (https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet) which might be negate the need for the populateMatrix function.
  4. Can you update the MLlib docs in docs/mllib-naive-bayes.md ?

Thanks!

@leahmcguire
Copy link
Contributor Author

Thanks for the comments!

The JIRA for the python API is:
https://issues.apache.org/jira/browse/SPARK-5328

I will get the rest fixed tonight or tomorrow.

@mengxr
Copy link
Contributor

mengxr commented Jan 21, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25855 has started for PR 4087 at commit ce73c63.

  • This patch merges cleanly.

…. Public api now has string instead of enumeration. Docs are updated."
@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25856 has started for PR 4087 at commit 4a3676d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25856 has finished for PR 4087 at commit 4a3676d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25856/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25855 has finished for PR 4087 at commit ce73c63.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25855/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25894 has started for PR 4087 at commit 0313c0c.

  • This patch merges cleanly.

@rnowling
Copy link
Contributor

@leahmcguire The updated patch looks great to me. :)

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25894 has finished for PR 4087 at commit 0313c0c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25894/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26099 has started for PR 4087 at commit 76e5b0f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26099 has finished for PR 4087 at commit 76e5b0f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26099/
Test FAILed.

val testRDD = sc.parallelize(testData, 2)
testRDD.cache()

val model = NaiveBayes.train(testRDD)
val model = NaiveBayes.train(testRDD, 1.0, "Bernoulli") ///!!! this gives same result on both models check the math
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering--- is the bug listed here still happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this was resolved before the commit. I just forgot to remove the comment

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28010 has started for PR 4087 at commit d9477ed.

  • This patch does not merge cleanly.

@jkbradley
Copy link
Member

There have been a lot of changes, which must be causing the merge issues. Could you please fix them? (Sorry for the slow review; I'll try to review ASAP once it merges cleanly.)

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28010 has finished for PR 4087 at commit d9477ed.

  • This patch fails MiMa tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28010/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28685/
Test PASSed.

@jkbradley
Copy link
Member

@leahmcguire It looks like the unclean merge came from the PR earlier today for adding Python save/load. I think rebasing and fixing conflicts should be straightforward. I'll update the save/load for versions ASAP if you can fix the merge issues. Thanks very much!

@SparkQA
Copy link

SparkQA commented Mar 21, 2015

Test build #28936 has started for PR 4087 at commit 852a727.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 21, 2015

Test build #28936 has finished for PR 4087 at commit 852a727.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(
    • class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Serializable
    • class NaiveBayesModel(Saveable, Loader):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28936/
Test PASSed.

jkbradley and others added 3 commits March 22, 2015 14:14
…ype parameter was added. Updated tests. Also updated ModelType enum-like type.
Added model save/load version to support NaiveBayes ModelType
@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29119 has started for PR 4087 at commit 2224b15.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29119 has finished for PR 4087 at commit 2224b15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(
    • case class Data(
    • s" but class priors vector pi had $
    • s" but class conditionals array theta had $

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29119/
Test PASSed.

@jkbradley
Copy link
Member

So...that discussion on the mailing list about enum-like types just keeps going with no decision yet. Speaking with @mengxr , it might be best to support only String values for the model type instead of something nicer like enum. I don't want to make you change that, so would you mind if I sent 1 more PR to your PR?

@jkbradley
Copy link
Member

(I was about to merge this, but then this issue came up.) After that adjustment, it should be fine. (And feel free to make this change yourself, but I'm offering to do it since the dev list discussion keeps going back and forth.)

@leahmcguire
Copy link
Contributor Author

Either version is fine. If you have time to make the change on tomorrow go
ahead and send the PR. Otherwise I'll have time to make the change on
Friday.

On Wed, Mar 25, 2015 at 12:41 PM, jkbradley notifications@github.com
wrote:

(I was about to merge this, but then this issue came up.) After that
adjustment, it should be fine. (And feel free to make this change yourself,
but I'm offering to do it since the dev list discussion keeps going back
and forth.)


Reply to this email directly or view it on GitHub
#4087 (comment).

@jkbradley
Copy link
Member

If you have time, I'd really appreciate it---thank you! We can eliminate the special enum-like types entirely and just use String.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29336 has started for PR 4087 at commit acb69af.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29336 has finished for PR 4087 at commit acb69af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(
    • case class Data(
    • s" but class priors vector pi had $
    • s" but class conditionals array theta had $

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29336/
Test PASSed.

if (supportedModelTypes.contains(modelType)) {
new NaiveBayes(lambda, modelType).run(input)
} else {
throw new UnknownError(s"NaiveBayes was created with an unknown ModelType: $modelType")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use require? Since this is an entry point, the parameter check should throw an IllegalArgumentException (which require does). Elsewhere, in the internals, we can throw UnknownErrors since those errors should never actually happen.

require(supportedModelTypes.contains(modelType), s"NaiveBayes was created with an unknown ModelType: $modelType")

@jkbradley
Copy link
Member

@leahmcguire Thanks for updating the enum type. I just made 2 tiny comments; other than that, it looks fine.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29438 has started for PR 4087 at commit f3c8994.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29438 has finished for PR 4087 at commit f3c8994.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(
    • case class Data(
    • s" but class priors vector pi had $
    • s" but class conditionals array theta had $
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29438/
Test PASSed.

@jkbradley
Copy link
Member

LGTM Thanks very much for bearing with the issues in getting this in! Merging into master

@asfgit asfgit closed this in d01a6d8 Mar 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants