Skip to content

[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes#15826

Closed
yanboliang wants to merge 4 commits intoapache:masterfrom
yanboliang:spark-14077-2
Closed

[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes#15826
yanboliang wants to merge 4 commits intoapache:masterfrom
yanboliang:spark-14077-2

Conversation

@yanboliang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

  • Refactor out trainWithLabelCheck and make mllib.NaiveBayes call into it.
  • Avoid capturing the outer object for modelType.
  • Move requireNonnegativeValues and requireZeroOneBernoulliValues to companion object.

How was this patch tested?

Existing tests.

@yanboliang yanboliang changed the title [SPARK-14077][ML][FOLLOW-UP] Minor cleanups for NaiveBayes [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes Nov 9, 2016
yanboliang referenced this pull request Nov 9, 2016
…te extra pass on data

## What changes were proposed in this pull request?
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses).
* mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above.

During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use
```val labels = data.map(_.label).distinct().collect().sorted```
to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15402 from yanboliang/spark-17835.
private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli)
private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli)

private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thunterdb I define these two functions still by val which should be more appropriate, since there is difference between val and def to define a function. http://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method

@yanboliang
Copy link
Copy Markdown
Contributor Author

cc @thunterdb @zhengruifeng

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 9, 2016

Test build #68398 has finished for PR 15826 at commit 48f7e86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang thanks! I have a few small comments.

private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli)
private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli)

private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method

s"Naive Bayes requires nonnegative feature values but found $v.")
}

private[NaiveBayes] val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here

if (isML) {
private[spark] def trainWithLabelCheck(
dataset: Dataset[_],
positiveLabel: Boolean = true): NaiveBayesModel = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is an internal method, I suggest to not use a default argument and specify it explicitly above.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 10, 2016

Test build #68443 has finished for PR 15826 at commit 89d0c77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Copy Markdown
Contributor Author

@thunterdb Any more comments? Thanks!

@thunterdb
Copy link
Copy Markdown
Contributor

@yanboliang that looks great, thank you. LGTM.

@yanboliang
Copy link
Copy Markdown
Contributor Author

Merged into master and branch-2.1. Thanks for reviewing.

@asfgit asfgit closed this in 22cb3a0 Nov 12, 2016
asfgit pushed a commit that referenced this pull request Nov 12, 2016
## What changes were proposed in this pull request?
* Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it.
* Avoid capturing the outer object for ```modelType```.
* Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15826 from yanboliang/spark-14077-2.

(cherry picked from commit 22cb3a0)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
@yanboliang yanboliang deleted the spark-14077-2 branch November 12, 2016 14:21
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
* Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it.
* Avoid capturing the outer object for ```modelType```.
* Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#15826 from yanboliang/spark-14077-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants